 Do you remember imagining yourself as an archaeologist at the beginning of this course? You stood in front of an ancient riverbed at dawn excited at the possibilities of what you might find under the rock. What stories would you uncover? What long lost mysteries might you reveal? After learning the content in this course, hopefully you've had the same level of excitement our archaeologist feels when it comes to EDA and visualizations. As you've learned, the six practices of EDA help find the stories that need to be told from data sets. As you're discovering, structuring, and cleaning in your career, I hope that you are digging through the data with determination, gathering your major finds together, questioning your perspective, and researching more about your discoveries. Then, I hope you remember the other three practices of joining, validating, and presenting to complete your EDA work. In this course, you've had the chance to explore how data professionals take care of the stories that have plot holes or puzzling scenes, or that is missing data and outliers. You also learn how to change categorical data into numerical data using the label encoding technique. And finally, you considered how to design visualizations and present your data in really impactful ways. You learn the advanced concepts of visualizing and you started using Tableau. Throughout these lessons, you learn some workplace skills like communicating to different audiences, the importance of ethics, the need for accessibility, and the importance of following the pace workflow. These are the types of skills which will serve you throughout your career, from entry level data professional to senior data professional and beyond. In the upcoming courses, you'll be learning some integral concepts in statistics, regression, and machine learning. The knowledge you've gained in our course will be foundational to your progression through these next courses as well as through your career as a professional in data analytics. It has been my pleasure instructing you. The practices of EDA and data visualization are close to my heart and I'm always excited to meet future data professionals learning these principles. Great job on completing this course. You have the makings of a solid storytelling data professional. May you always find excitement in exploring and telling stories using data. Hey there. Welcome to the next stage of your journey to learn advanced data analytics. First, congratulations on your progress. You've learned how data professionals contribute to the success of an organization and the main tools and techniques they use on the job. You're now familiar with the basic syntax and functions of the Python programming language, and you know how to use code for exploratory data analysis. You can use data wrangling to organize and clean your data and create data visualizations to share important information. Well done. You already have quite a few tools in your analytics toolbox. Have room for a few more? Next up, statistics. Statistics is a study of the collection, analysis, and interpretation of data. Statistics, often abbreviated as stats, provides data professionals with powerful tools and methods for transforming data into useful knowledge. You've already learned about exploratory data analysis and how it helps you summarize the main characteristics of your data. Descriptive statistics does this too, and that's where we'll start. But data professionals also use statistics to do something more. Based on a small sample of available data, they can make informed predictions about uncertain events and make accurate estimates about unknown values. This is known as inferential statistics, and you'll learn all about it in this course. For example, data professionals use statistics to predict future sales revenue, the success of a new ad campaign, the rate of return on a financial investment, or the number of downloads for a new app. Statistical analysis can tell you which version of a website will attract more new customers for longer periods of time, or that new users will typically create an account after spending three minutes on the company's website. The insights gained from statistical analysis help business leaders make decisions, solve complex problems, and improve the performance of the products and services. This is why data professionals are in such high demand and why the data career space keeps growing. Speaking of data professionals, allow me to introduce myself. My name is Evan. I'm an economist and I consult with various teams across Google. This means that I use statistics and other tools to analyze and interpret data to help business leaders make informed decisions. This includes helping them quantify uncertainty and identify their sufficient evidence to reject a hypothesis, both of which you'll learn more about later on. I'm thrilled to be your instructor in this course. Before we begin, let me tell you about my own experience with statistics. As an undergraduate, I majored in economics and mathematics, and then went on to get a PhD in economics. I focused on statistics and econometrics, a branch of economics that uses statistics to analyze economic problems. During my graduate studies, I interned at an online learning company and was a researcher at an online retail company. Across these roles and experiences, I've used many different statistical tools to solve problems. Often, I find that the problem I'm working on can be solved with a statistical method I'm unfamiliar with. I love constantly learning new methods and extending the range of problems I can work on. These advanced methods are built on a foundation of stats concepts. In this course, we'll focus on these fundamentals to prepare you for your future career. So if you're new to stats, welcome. This course does not assume you have any prior knowledge of statistics. We'll begin from the beginning and work through each concept step by step. But if you have some experience in statistics, that's great, too. We'll help you use what you already know in a new way so you can apply your stats knowledge to data analytics specifically. In this course, you'll discover how data professionals use statistical tools in their daily work. You'll also learn strategies for interpreting findings and sharing them with stakeholders who may not be familiar with stats concepts or all the technical details. We'll start this course with an introduction to the role of stats in data analytics, and we'll discuss the differences between descriptive and inferential stats. You'll learn how descriptive stats, such as mean, median, and standard deviation, help you quickly summarize and better understand your data. Then we'll explore how to use inferential statistics to draw conclusions and make predictions about data. Next, we'll explore probability and discover useful ways to measure uncertainty. We'll discuss the basic rules of probability and how to interpret different types of probability distributions, such as the normal, binomial, and Poisson distributions. From there, we'll move on to sampling. We'll discuss what makes a good sample, the benefits and drawbacks of different sampling methods, and how to work with sampling distributions. We'll also examine confidence intervals, which describe the uncertainty in an estimate. You'll learn how to construct different kinds of confidence intervals and interpret their meaning. After that, we'll explore how to use hypothesis testing to compare and evaluate competing claims about your data. We'll go over the steps for applying different tests to specific data sets, and we'll demonstrate how to interpret test results. Finally, you'll get a chance to apply your stats knowledge in your next portfolio project. The portfolio project features a scenario based on AB testing, an important practical application of statistics. In future job interviews, you can share your project as a demonstration of your skills and impress potential employers. I'll be your guide every step of the way. And remember, you set the pace. Feel free to go over the videos as many times as you like and review topics that are new to you. By the end of the course, you'll have a useful toolkit of stats concepts to carry with you on the rest of your learning journey and in your future career. Let's get started. Hi there, I can't wait to explore statistics with you. During our journey together, you'll learn how data professionals use statistics to gain insights from data and help organizations solve complex problems. Statistics is the study of the collection, analysis, and interpretation of data. We'll start off by discussing the foundational role of statistics in data-driven work and the importance of fundamental stats concepts. Next, you'll have an opportunity to observe stats in action. We'll explore an example of how data professionals use statistical methods to conduct an A-B test. Then we'll discuss the two main types of statistics, descriptive and inferential. Data professionals use descriptive stats to explore and summarize data. They use inferential stats to draw conclusions and make predictions about data. Next, we'll consider three different types of descriptive statistics and how they can help you better understand different aspects of data. Measures of central tendency, such as the mean, allow you to describe the center of a data set. Measures of dispersion, like standard deviation, let you describe the spread of data. Measures of position, such as percentiles, help you determine the relative position of the values in a data set. Finally, you'll learn how to use Python to compute descriptive statistics and summarize a data set. When you're ready, join me in the next video. Earlier, you learned that statistics is the study of the collection, analysis, and interpretation of data. Today, humans generate and collect more data than ever before. Whenever we send a text message, make a purchase online, or post a photo on social media, we generate new data. As the amount of data grows, so does the need to analyze and interpret it. This is a big reason why stats and data-driven work is so important, and the field of data analytics is growing almost as fast as the data itself. Data professionals use statistics to analyze data in business, medicine, science, engineering, government, and more. In this video, we'll discuss the role of statistics in data science and why learning fundamental stats concepts is essential for every data professional. Data professionals use the power of statistical methods to identify meaningful patterns and relationships in data, analyze and quantify uncertainty, generate insights from data, make informed predictions about the future, and solve complex problems. Even if you've never studied statistics, you probably use stats daily. For example, you may start your day by going online and checking the weather, where you learn that this forecast is for a 70% chance of rain, or a 50% chance of snow. Perhaps you visit a sports website to learn the batting average of your favorite cricket player, or the scoring average of your favorite basketball player. On a news app, you might come across an election poll that reports a 3% margin of error and notes that an online survey was used to collect the data. Or, perhaps you're a parent, and when you take your child to their yearly checkup, you learn that your child is in a certain percentile for height and weight. When you ask for more information, the doctor shows you the median height and weight for all kids who are the same age. These scenarios include statistical concepts that you'll learn more about in this course. The weather report is based on probability, or the likelihood of an event. The sports stats express average value. The election poll shows margin of error. The doctor uses the concepts of percentile and median, and all these stats give you useful knowledge that you can apply to your own life. Data professionals use the same concepts in their work. For example, a data professional might use probability to predict the future rate of return on an investment. They might estimate the annual average sales revenue for a company, calculate the margin of error to quantify the uncertainty of an employee satisfaction survey, or use percentiles to rank medium home prices in different cities. On the job, data professionals use stats to transform data into insights that help stakeholders make decisions. Statistics is the foundation of data analytics and is the basis for the most advanced methods of analysis that data professionals use. And it all begins with the fundamental concepts that we're exploring in this course. Consider the role grammar plays in your conversations. For example, when you chat with friends or coworkers, you're probably not thinking about grammatical concepts like the parts of speech. If you're having a conversation, then you already know how to use nouns, verbs, and adjectives. Knowledge of basic grammar makes it possible to use language in the first place. This is why it's so foundational. In a similar way, shared knowledge of basic statistics allows data professionals to use a common language. Learning the basics will eventually let you join a conversation about more advanced topics. You'll build on your foundation and statistics with more complex methods like hypothesis testing, classification, regression, and time series analysis. I hope you're getting excited about how data professionals use stats to make sense of their data and gain useful knowledge. Coming up, you'll get a chance to check out an example of stats in action. Today's economy is all about data. Business leaders want to make data-driven decisions based on evidence and analysis. Companies that use insights gained from data to guide their decision-making process are more likely to be successful than companies that don't. And data professionals are the people that generate those insights. They use statistics to transform data into knowledge and help stakeholders make informed decisions. All the fundamental stats concepts that we cover in this course have valuable practical applications. In this video, you'll get a chance to check out stats in action. We'll go over one of the most popular applications of statistics for business, A-B testing. I'll discuss how the stats concept you'll learn in this course can help you analyze and interpret data using an A-B test. Companies use A-B testing to evaluate everything from website design to mobile apps to online ads to marketing emails. A-B testing is a way to compare two versions of something to find out which version performs better. A-B testing has become popular because it works well for many online applications. For example, businesses often use A-B testing to compare two versions of a web page to find out which one gets more clicks, purchases, or subscriptions. Even small changes to a web page, like changing the color, size, or location of a button, can increase financial gains. A-B tests help business leaders optimize product performance and improve customer experience. Another way companies use A-B testing is for marketing emails. You might send two versions of an email to your customer list to find out which version results in more sales. Or you might test two versions of an online ad to discover which one visitors click on more often. Once you've conducted the A-B test, you can use the data to make permanent changes to your ad. Let's go through an example of an A-B test step by step. Imagine you run an online store and 10% of visitors to your website make a purchase. You want to run an A-B test to find out if changing the size of the add to cart button will increase the conversion rate or the percentage of customers who purchase a product. The test presents two versions of your webpage known as version A and version B to a group of randomly selected users. Version A is the original webpage. Version B is the webpage with the larger add to cart button. The test directs half the users to version A and half to version B. The test runs for two weeks. When the test is over, a statistical analysis of the results indicates that the larger button in version B resulted in an increase in purchases. The conversion rate for version B is 30%. This is three times greater than the conversion rate for version A, which is 10%. That's a notable increase. Because of your A-B test, your company has a data-driven reason for replacing the current webpage with version B and increasing the size of an add to cart button. Now that you know how an A-B test works, let's explore the stats concepts behind A-B testing. Later on, we'll cover each concept in more detail. Think of this list as a brief preview of your future stats knowledge. The A-B test analyzes a small group of users drawn from the total population of all users that visit the website. In stats, we call this smaller group the sample. The sample is a subset of the larger population. You can use data from a sample to make inferences or draw conclusions about the entire population. Data professionals use inferential statistics to make inferences about a dataset based on a sample of the data. In other words, stats are a powerful tool for predicting outcomes you don't know using data you do know. For example, you have no way of knowing how the next 100,000 website visitors will behave. What you can do is observe the next 1,000 visitors and then use inferential statistics to predict how the following 99,000 will behave. And as you'll discover, stats can help you make that prediction with accuracy. This is why observing a sample through A-B testing can be so valuable to companies. They can use the results of the test to make changes that improve their business. Sampling, the process of selecting a subset of data from a population is a critical part of an A-B test. Before you conduct the test, you need to decide on the sample size or the number of users in the test. Choosing the right sample size helps you get valid test results and avoid statistical errors. For example, you'll use stats to help you determine whether you need to use a sample size of 1,000 or 10,000 in order to predict customer behavior accurately. Like any statistical test, an A-B test can't predict user behavior with 100% certainty. What stats can do is construct a confidence interval or a range of values that describes the uncertainty of surrounding an estimate. Knowing how to construct and interpret a confidence interval helps you make informed decisions about all users based on your test sample. Using stats, you can quantify the uncertainty of your A-B test and share this information with stakeholders to help them interpret the results. We'll talk more about how to interpret a confidence interval later on. After the test is complete, you'll need to determine the statistical significance of your results. Statistical significance refers to the claim that the results of a test or experiment are not explainable by chance alone. For instance, is the difference between version A and version B due to random chance or due to the fact that you changed the add to cart button? A hypothesis test is a statistical method that helps you answer this question. The test helps quantify whether the result is likely due to chance or if it's statistically significant. A hypothesis test gives you data-driven support for changing your web page to version B or for keeping it the same with version A. Software can help you calculate complex math problems, but having a working knowledge of stats lets you properly design, conduct, and interpret the results of a real test. By the end of this course, you'll know how to use all the stats concepts we just reviewed to analyze and interpret data. In fact, you'll be able to put your stats skills to work in a portfolio project based on a realistic A-B testing scenario. Plus, your knowledge of stats will serve as a foundation for more advanced data analytics methods that you'll explore later on. Now that you know more about the role of statistics in data science, let's discuss the two main types of statistical methods, descriptive and inferential. Data professionals use each method to get different insights from their data. In this video, you'll learn the difference between descriptive and inferential stats and how data professionals use both to better understand their data. Descriptive statistics describe or summarize the main features of a data set. Descriptive stats are very useful because they let you quickly understand a large amount of data. For example, let's say you had data on the heights of 10 million people. You probably don't want to scan 10 million rows of data to analyze it for your report. Even if you did, it would be difficult to interpret the data. However, if you summarize the data, you can instantly make it meaningful. Finding the mean or average height gives you useful knowledge about the data. Plus, reading a summary is much better than staring at millions of rows of data. There are two common forms of descriptive statistics, visuals, like graphs and tables, and summary stats. Previously, you learned how graphs and tables can help you explore, visualize, and share your data. You're likely familiar with data visualizations such as histograms, scatter plots, and box spots. Summary statistics let you summarize your data using a single number. A common example is the mean or average value. There are two main types of summary stats, measures of central tendency and measures of dispersion. Measures of central tendency, like the mean, let you describe the center of your data set. Measures of dispersion, like standard deviation, let you describe the spread of your data set or the amount of variation in your data points. Stats like mean and standard deviation are used to describe and summarize data. But data professionals do more than just describe their data. They also draw conclusions and make predictions based on data. For this, they use inferential statistics. Inferential statistics allow data professionals to make inferences about a data set based on a sample of the data. The data set that the sample is drawn from is called the population. The population includes every possible element that you are interested in measuring. And as we've discussed, a sample is a subset of a population. Data professionals use samples to make inferences about populations. In other words, they use the data they collect from a small part of the population to draw conclusions about the population as a whole. Note that a statistical population may refer to people, objects, or events. For instance, a population might be the set of all residents of a country, all the planets in our solar system, or all the outcomes of 1,000 coin flips. A sample is a smaller group or subset of any of these populations. Samples might be residents, planets, or coin flip outcomes. Let's check out an example. Say you want to research the music preferences of every college student in the United States to find out whether they prefer pop, rap, country, classical, or another genre of music. There are around 20 million college students in the United States and would be too expensive and time-consuming to gather data from every single person. Instead, you can use a sample and survey only a subset of the 20 million students. Later on, we'll discuss the factors that go into choosing different sample sizes and how larger sample sizes affect your results. For now, let's imagine you decide to survey 1,000 students instead of 20 million. Then you can use the results to make inferences about the music preferences of all college students. Keep in mind that your sample should be representative of your population. Otherwise, the conclusions you draw from your sample will be unreliable and possibly biased. A representative sample is a sample that accurately reflects the population. For example, if you only survey math majors or only student-athletes, your sample will not be representative of all college students. Finally, let's review two terms that correspond to population and sample, parameter and statistic. A parameter is a characteristic of a population. A statistic is a characteristic of a sample. For example, the average height of the entire population of giraffes is a parameter. The average height of a random sample of 100 giraffes is a statistic. As I mentioned, it's difficult to collect data about every member of a large population. In this case, to locate and measure the height of every single giraffe in the world. So we use the known value of a sample statistic, the average of 100 giraffes, to estimate the unknown value of a population parameter. That's all for now. We've covered a lot of key concepts that are foundational for what you'll learn later on in the course. Coming up, we'll return to the topic of inferential statistics. We'll explore sampling in more detail and check out common methods of inferential stats such as confidence intervals and hypothesis testing. Every time I explore a new dataset, I feel a sense of excitement. It's like exploring a city for the first time. When I visit a new city, I often start my journey at the city's center. This way, I can figure out the distance between the center and the city limits or to a famous landmark I want to visit. Knowing where I am in relation to the center helps me find my way around. It's the same one I want to learn about a new dataset. First, I want to know the center of my dataset. Then I want to know how spread out the other values are from the center. Measuring the center and the spread of a dataset helps me quickly understand its overall structure and decide which parts I want to explore in more detail. Earlier, you learned that summary statistics include measures of central tendency and measures of dispersion. Measures of central tendency are values that represent the center of a dataset. Measures of dispersion are values that represent the spread of a dataset. We'll talk more about measures of dispersion later on. In this video, you'll learn how to calculate three measures of central tendency, the mean, the median, and the mode. You may remember these terms from earlier in the program, but we'll discuss their importance in the study of statistics and data analysis. We'll also discuss which measure is best used based on your specific data. Let's start with the mean. The mean is the average value in a dataset. To calculate the mean, you add up all the values in your dataset and divide by the total number of values. For example, say you have the following set of values, 10, eight, five, 77. To find the mean, you add up all the values for a total of 100. Then you divide by five the total number of values. The mean or average value is 20. Next, the median. The median is the middle value in a dataset. This means half the values in the dataset are larger than the median and half are smaller. You can find the median by arranging all the values in a dataset from smallest to largest. If you arrange your five values in this way, you get five, seven, eight, 10, 70. The median or middle value is eight. If there are an even number of values in your dataset, the median is the average of the two middle values. Let's say you add another value, four to your set. Now, the two middle values are seven and eight. The median is their average, 7.5. You may notice that the mean, 20, is much greater than the median, eight. This is because there's one extreme value, 70, that increases the overall average. This value is known as an outlier. Recall that an outlier is a value that differs greatly from the rest of the data. As measures of central tendency, the mean and the median work better for different kinds of data. If there are outliers in your dataset, the median is usually a better measure of the center. If there are no outliers, the mean usually works well. For example, imagine you want to buy a home in a specific neighborhood. You tour 10 homes in the area to get an idea of the average price. The first nine homes have a price of $100,000. The 10th home has a price of a million dollars. This is an outlier that pulls up the average. If you add all the home prices and divide by 10, you find that the mean or average price is $190,000. The mean doesn't give you a good measure of the typical value of a home in this neighborhood. In fact, only one home out of 10 costs more than $100,000. The median home price is $100,000. The median gives you a much better idea of the typical value of a home in this neighborhood. Whether you use the mean or median depends on the specific dataset you're working with and what insights you want to gain from your data. Finally, we have the mode. The mode is the most frequently occurring value in a dataset. A dataset can have no mode, one mode, or more than one mode. For example, the set of numbers one, two, three, four, five has no mode because there are no value repeats. In the set one, three, three, five, seven, the mode is three because three is the only value that occurs more than once. The set one, two, two, four, four has two modes, two and four. The mode is useful when working with categorical data because it clearly shows you which category occurs most frequently. Say an online retail company conducts a survey. Customers rate their experience as bad, mediocre, good, or great. A bar chart summarizes the results. The highest bar refers to the rating bad. This is the most frequently occurring value or mode. The mode gives the company clear feedback on customer satisfaction. To recap, the mean, median, and mode all measure the center of a dataset in different ways. The mean finds the average value, the median finds the middle value, and the mode finds the most frequently occurring value. Knowing the center of your dataset helps you quickly understand its basic structure and determine the next steps in your analysis. Just like knowing the city center helps orient you in a new environment. As I mentioned earlier, there are two things I want to know when I begin to explore a new dataset. First, the location of the center or the measures of central tendency, like the mean. Second, I want to know how spread out the values are from the center or the measures of dispersion, like the standard deviation. To get a complete picture of the data, it's good to know both the center and the spread. For example, datasets with the same central value can have different levels of variability or spread. Take three small datasets. Each set has three values that add up to 90. Each set has the same mean, 90 divided by three equals 30. But the variation of values around the mean is much different. In the first set, the values 25, 30, and 35 are all close to the mean of 30. In the third set, the values 5, 10, and 75 are much more spread out from the mean. Earlier, you learned how measures of central tendency, like mean, median, and mode, represent the center of your dataset. Now, you'll learn how measures of dispersion, such as the range and standard deviation, can help you understand the spread of your data. The range is the difference between the largest and smallest value in a dataset. So you have data on the daily temperature in Fahrenheit for the city of Central Valley, Costa Rica, for the past week. The highest temperature is 77 degrees. The lowest temperature is 67 degrees. So the range is 10. The range is a useful metric because it's easy to calculate and gives you a very quick understanding of the overall spread of your dataset. However, standard deviation gives you a more nuanced idea of the variation in your data. Standard deviation measures how spread out your values are from the mean of your dataset. It calculates the typical distance of a data point from the mean. The larger the standard deviation, the more spread out your values are from the mean. Another measure of spread is called the variance, which is the average of the squared difference of each data point from the mean. Basically, it's the square of the standard deviation. You'll learn more about variance and how to use it in a later course. Let's check out the plots of three normal probability distributions to get a better idea of spread. Later on, you'll learn more about distributions, which map all the values in a dataset. For now, just know that the mean is the highest point on each curve right in the center. Each curve has a different standard deviation. Blue is one, green is two, and red is three. The blue curve has the least spread since most of its data points fall close to the mean. Therefore, the blue curve has the smallest standard deviation. The red curve has the largest standard deviation. It has the most spread since most of its data points fall farther away from the mean. Now, let's talk about how you determine with these numbers. Here's the formula for the standard deviation of a sample. In other words, the square root of sigma times open parenthesis x minus x bar close parenthesis squared divided by n minus one. If you're new to stats, this formula may seem like a secret code or an unfamiliar language. That's okay. We'll go over the variables and formula step by step. Plus, you don't need to memorize the formula or do all the math on your own. As a data professional, you'll typically use a computer for calculations. Being able to perform calculations is important for your future career, but being familiar with the concepts behind the calculations will help you apply statistical methods to workplace problems. There are different formulas to calculate the standard deviation for a population in a sample. As a reminder, data professionals typically work with sample data and they make inferences about populations based on the sample. So we'll review the formula for a sample. Let's consider how to interpret the formula by calculating the standard deviation of a smallest data set, eight, 10, and 12. Calculating the standard deviation involves five steps. First, find the mean of the data set. The mean equals 10. Next, for each value x in our data set, we find the distance to the mean and then we square it. We'll include that calculation in our next step. The Greek letter sigma is a symbol that means sum. So we need to do the x minus 10 squared calculation for each data point and add up all the results. That's eight minus 10 squared is four. 10 minus 10 squared is zero and 12 minus 10 squared equals four. Add all those together to equal eight. Then divide that total by n minus one. n refers to the total number of values in your data set, which is three. So three minus one equals two. Then our sum of eight divided by two equals four. Finally, take the square root of four. That's two, the standard deviation. Now let's explore an example of how standard deviation is useful in everyday life. Meteorologists use standard deviation for weather forecasting to understand how much variation exists in daily temperatures in different places and to make more accurate predictions about the weather. Imagine two meteorologists working in two different cities, city A and city B. During the month of March, city A has a mean temperature of 66 degrees Fahrenheit and a standard deviation of three degrees. City B has a mean temperature of 64 degrees Fahrenheit and a standard deviation of 16 degrees. Both cities have similar mean temperatures. In other words, the overall average temperature is about the same. But the standard deviation is much higher in city B. This means there's more daily variation in temperature in city B. So the weather may change dramatically from day to day there. In city A, the weather's more consistent. If the meteorologists in city B predicted the weather based on the mean, they could be off by 16 degrees, which would lead to a lot of unhappy residents. The standard deviation gives the meteorologist a useful measure of variation to consider and a level of confidence about their prediction. A low standard deviation in temperature makes it a lot easier for the meteorologist in city A to accurately predict the daily weather. Data professionals use standard deviation to measure variation in many types of data, like ad revenues, stock prices, employee salaries, and more. Now you have a better idea of how standard deviation measures the spread of your data. Coming up, we'll discuss some ways of understanding the relative position of the values in a dataset. By now, you've learned different ways to describe the center of your data with measures of central tendency, such as the mean and median. You can also use measures of dispersion, such as the standard deviation, to represent the spread of your data. These tools will help you explore and better understand any dataset you may encounter. Now we'll finish our tour of descriptive statistics by checking out measures of position. Measures of position help you determine the position of a value in relation to other values in a dataset. Along with center and spread, it's helpful to know the relative position of the values. For example, whether one value is higher or lower than another, or whether a value falls in the lower, middle, or upper portion of your dataset. In a city, this is similar to knowing where different places of interest are located in relation to one another. For example, it's useful to know how far the art museum is from the city park, or if the famous restaurant you want to eat at is close to the historical monument you want to visit. In this video, you'll learn about the most common measures of position, percentiles, and quartiles. You'll also learn how to calculate the interquartile range, and use the five number summary to summarize your data. A percentile is the value below which a percentage of data falls. Percentiles show the relative position or rank of a particular value in a dataset. Some university cities ask applicants to take standardized tests. For example, in the United States, the SAT and the ACT are common exams. When a student receives their test score, they usually also receive a corresponding percentile. For example, let's say a test score falls in the 99th percentile. This means the score is higher than 99% of all test scores. If a score falls in the 75th percentile, the score is higher than 75% of all test scores. If a score falls in the 50th percentile, the score is higher than half or 50% of all test scores, and so on. Percentiles are useful for comparing values. For example, different exams may have different scoring systems. SAT scores range from 400 to 1600. ACT scores range from one to 36. And a typical school exam in math or history may range from zero to 100. If you only know the raw scores for each test, say a thousand for the SAT, 20 for the ACT, and 70 for the school exam, you have no way of making a meaningful comparison. If you know that all three test scores fall in the 50th percentile, then you can meaningfully compare student performance across the different exams. You can use quartiles to get a general understanding of the relative position of values. A quartile divides the values in a data set into four equal parts. Quartiles let you compare values relative to the four quarters of data. Each quarter includes 25% of the values in your data set. The first quartile is the middle value in the first half of the data set. The first quartile, Q1, is also called the lower quartile. 25% of the data points are below Q1, and 75% are above it. The second quartile is the median value in the set. Q2 is the median. 50% of the data points are below Q2, and 50% are above it. The third quartile is the middle value in the second half of the data set. 75% of the data points are below Q3, and 25% are above it. Note the relationship between quartiles and percentiles. Q1 refers to the 25th percentile, Q2 the 50th percentile, and Q3 to the 75th percentile. For example, say you're the manager of a sports team. You have data that shows how many goals each player on your team scored over the course of an entire season. You want to compare the performance of each player based on scoring. You can calculate quartiles for your data using these steps. First, arrange the values from smallest to largest, 11, 12, 14, 18, 22, 23, 27, 33. Second, find the median of your data set. This is the second quartile, Q2. There are an even number of values in the data set, so the median is the average of the two middle values, 18 and 22. Q2 equals 20. Third, find the median of the lower half of your data set. This is the lower quartile, Q1. Q1 equals 13. Finally, find the median of the upper half of your data set. This is the upper quartile, Q3. Q3 equals 25. Breaking the data into quartiles give you a clear idea of player performance. You now know that the lower quartile of players scored 13 goals or less, and the upper quartile scored 25 goals or more. In other words, the lower 25% of players scored 13 goals or less, and the upper 25% scored 25 goals or more. The middle 50% of players scored between 13 and 25 goals. The middle 50% of your data is called the interquartile range, or IQR. The interquartile range is the distance between the first quartile, Q1, and the third quartile, Q3. Technically, IQR is a measure of dispersion because it measures the spread or the middle half or middle 50% of your data. This is the same as the distance between the 25th and 75th percentiles, or between Q1 and Q3. IQR is also useful for determining the relative position of your data values. IQR equals Q3 minus Q1. In this case, Q3 equals 25 and Q1 equals 13, so IQR equals 12. Finally, you can summarize the measured divisions in your data set with the five number summary. The five numbers include the minimum, the first quartile, the median, or second quartile, the third quartile, and the maximum. For your sports data, the five number summary is 11, 13, 20, 25, 38. The five number summary is useful because it gives you an overall idea of the distribution of your data from the extreme values to the center. You can visualize it with a box plot. The box part of the box plot goes from the first quartile to the third quartile. The vertical line in the middle of the box is the median. The horizontal lines on each side of the box, known as whiskers, go from the first quartile to the minimum value, and from the third quartile to the maximum value. The following box plot shows the data on goals. We can find the values on the box plot and determine the interquartile range. Q1, or the lower quartile, equals 13. Q3, or the upper quartile, equals 25. The interquartile range is the length of the box. 25 minus 13 equals 12. Data professionals use measures of position, such as percentiles and quartiles, to better understand all kinds of data. This may include public health data, such as life expectancy, economic data, such as household income, business data, such as product sales, and more. That concludes our tour of descriptive statistics. Coming up, you'll use Python to compute descriptive stats and summarize a dataset. I'm Alok. I'm a data science developer advocate at Google Cloud. What we do generally is talk to developers about how to use Google Cloud. The way I think about statistics is that it's a combination of mathematics and applying it to data. It's important for data professionals to learn about statistics because I think it gives you a good foundation for the math behind the techniques you're going to apply and to give a breath of information for how you might apply this math to different problems. In my first job at Google, where I was a data scientist on the search ads team, we used statistical methods to generate insights that informed decisions all the time. In fact, it was essentially the core of the job. One time where statistics helped influence decision makers from my experience was a project where we had two groups, let's say on A and B. And we were seeing some big differences in the behavior of these two groups in a particular metric. The executives were kind of worried about why were they different, maybe it shouldn't be so different. And statistical methods were crucial to this. We had to adjust for things like mix effects, looking at different slices of our data, adding confidence intervals around the sort of mean difference we're seeing and what we found was that the difference was not as large as we had seen and it could be attributed to sort of these other things like mix among our users or things like that. Then when we presented this to the executives, they were sort of relieved that the differences were not as big and there wasn't something necessarily to do to change the dynamic between group A and B. It was sort of within the reasonable difference that they felt was okay. The idea is essentially statistics will give you a set of tools. And in this case, it gave me a set of tools to kind of decompose this problem into various parts and start to explain why we were seeing the differences we were seeing. The value of completing a program like this one is that it gives you the foundation to do a lot of great work in data science and data analytics. Having done some coursework in data as well as having some project experiences sets you up well to be able to analyze data and make an impact in whatever industry you end up working in. The best thing I can tell those who are kind of in the middle of this and maybe struggling through is keep your end goal in mind, right? Whether it's learning a new skill or opening up a whole new career path that can really be a differentiator for you. Try to keep it step by step if you fall a little bit behind, forgive yourself and just keep that end goal in mind. That's where you wanna get. Recently, you learned how descriptive statistics help you explore and summarize key features of your data. We talked about three different types of descriptive stats. Measures of central tendency, like the mean, refer to the center of your data set. Measures of dispersion, like the standard deviation, describe the spread of your data set. Measures of position, like percentiles and quartiles, show the relative location of your data values. Earlier in the program, you learned about the process of exploratory data analysis, or EDA, from discovering to presenting your data. Whenever a data professional works with a new data set, the first step is to understand the context of the data during the discovery stage. Often, this is done by discussing the data with project stakeholders and reading documentation about the data set and the data collection process. After that, a data professional moves on to data cleaning and deals with issues like missing data and incorrect values. Computing descriptive stats is a common step to take after data cleaning. Now, you'll learn how to use Python to compute descriptive statistics and summarize a data set. The great thing about using Python for stats is that it does all the difficult work for you and takes care of all the complex calculations. With a single line of code and the push of a button, you have the mean, median, standard deviation, and more. Using Python is like having a friendly math genius working right alongside you, a genius who never gets tired or distracted or needs a coffee break. Also, you're well-prepared to use Python for stats. You've already learned how to use Python to organize, clean, and visualize your data. In this course, we'll introduce some new functions specific to stats, but we'll continue to use the same syntax and coding concepts you're familiar with. Now, let's explore our example. Imagine you're a data professional working for the government of a large nation. The government's Department of Education is seeking to understand the current literacy rates across the country. The literacy rate is defined by the percentage of the population of a given age group that can read and write. You're asked to analyze the data about the literacy rate among primary and secondary school students. These are students who range in age from six to 18 years old. There's data available for every state and district in the country. You'll use descriptive stats to get a basic understanding of the literacy rate data for each district. So, let's open a Jupyter Notebook and get started. To start, import the Python packages you will use, NumPy and Pandas, and the library you will use, matplotlib.pyplot. You worked with NumPy, Pandas, and matplotlib.pyplot when you learned about EDA. They'll also help you compute descriptive stats. To save time, rename each package and library with an abbreviation, np, pd, and plt. We've provided a file to download so you can follow along. For this example, we'll name our data education underscore district-wise. This tells us that we're dealing with education data organized by district. As a best practice, choose names that clearly state the main content or purpose of the data. This way, you can easily access and remember your data in the future. Start with the head function to get a quick overview of the first 10 rows of your dataset. Recall that head will return as many rows of data as you input into the variable field. Before you start computing descriptive stats, review the contents of your dataset to understand what the column headers mean. Your dataset has seven columns and 680 rows. The first five columns refer to different administrative units, district name, state name, blocks, villages, and clusters. In other words, this is the system the nation uses for organizing its population into different sized units or sections. The nation is divided into states. States are further divided into districts. A large state may have more than 40 districts. A small state may have fewer than four districts. Each district is further divided into blocks, clusters, and villages. The total population column, abbreviated T-O-T-P-O-P-U-L-A-T, refers to population. The overall literacy rate column, abbreviated O-V-E-R-A-L-L underscore L-I, refers to literacy rate. To interpret this data correctly, it's important to understand that each row or observation refers to a different district, and not, for example, to a state or a village. So, the village column shows how many villages are in each district. The total population column shows the population for each district. The overall literacy column shows the literacy rate for each district. Now that you have a better understanding of your dataset, use Python to compute descriptive stats. When computing descriptive stats in Python, the most useful function to know is describe. Data professionals use the describe function as a convenient way to calculate many key stats all at once. If you use this function for a column with numeric data, you get a count of all the observations in the column, along with the following stats. The mean, or average value, the median, or middle value, the standard deviation, or the value that measures the spread of the data, the minimum and maximum values, and the first and third quartiles. Your main interest is the literacy rate. This data is contained in the overall literacy column, which shows the literacy rate for each district in the nation. Use the describe function to show key stats about literacy rate. The output lists key stats for all districts, and the count category confirms there are 634 districts in your dataset. Note that the number of observations for the overall literacy column is 634, but the number of rows in the dataset is 680. This is because the describe function does not include missing values. The summary of stats gives you valuable information about the overall literacy rate. For example, the mean helps to clarify the center of your dataset. You now know the average literacy rate is about 73% for all districts. This information is useful in itself, and also as a basis for comparison. Knowing the mean literacy rate for all districts helps you understand which individual districts are significantly above or below the mean. This will help the department decide how to devote resources to improving literacy. Note that the categories 25%, 50%, and 75% refer to Q1, Q2, and Q3, respectively. Remember that Q2 is also the median of your dataset. You can also use the describe function for a column with categorical data, like the state name column. In this case, you get account of all the observations in the column, along with the following information, the number of unique values, the most common value, or mode, and the frequency of the most common value. Use the describe function to find out how many states are in your dataset and which state has the most districts. The unique category shows you that there are 36 states in your dataset. The top category shows that you that state 21 is the most common value and contains the most districts. The frequency category tells you that state 21 appears in 75 rows, which means it includes 75 different districts. This is the mode. This information may be helpful in determining which states will need more educational resources based on their number of districts. The describe function is so useful because it shows you a variety of key stats all at once. Python also has separate functions for stats, such as mean, median, standard deviation, minimum, and maximum. And you use the mean and median functions earlier in the program to detect outliers. These individual functions are also useful if you want to do further computations based on descriptive stats. For example, you can use the min and max functions together to compute the range of your data. The range will show you the difference between the highest and lowest literacy rates among all districts. To compute the range, use the max and min functions to subtract the highest literacy rate from the lowest literacy rate. First, name a new variable, range underscore overall underscore li. Then use the max function on the overall literacy column, input a minus sign, and use the min function on the same column. Finally, display the value of your variable. The range in literacy rates for all districts is about 61.5 percentage points. This is the maximum value of 98.7% minus the minimum value of 37.2%. The large difference tells you that some districts have much higher literacy rates than others. In an upcoming video, you'll continue to analyze this data and you can discover which districts have the lowest literacy rates. This will help the government better understand literacy rates nationally and build on their successful educational programs. Using descriptive stats to summarize your data set is an important early step in the analysis process, giving you a basic understanding of your data. Hi, I'm Evan. I'm an economist at Google. When I was in high school, I was pretty good at math but wasn't overly interested in it. When I got to college, I took an economics course and was very interested in using that framework to view the world and solve problems. A typical day on the job involves me working with business leaders to understand their problems and then help them brainstorm solutions to their problems. Sometimes this is just talking and consulting on problems and helping them figure out solutions. Or other times I'll go and I'll gather data on my own from the company and I'll perform analyses and solve problems and help them identify interesting measures or results that could inform the decisions and solve the problems. I also like to devote time each day to doing my own research on topics that I'm unfamiliar with so that I'm always growing my skills and increasing my toolkit. Some of the soft skills that were most important in starting a career in data analytics was first being able to present your results. So you can do tons of work in getting data, mining through the data, trying to find some interesting pieces of information and you may find something but then being able to clearly communicate that to other people who may not be experts in what you're studying is actually quite difficult. Make sure you master the basics and don't try to go too fast. And if there's something you don't understand, rewatch it, do the reading, make sure you know the fundamentals. It all builds on itself. If I can give myself some advice when I was starting my first data analytics role, it's to take time to meet other people in the field and to network with them and learn what they know. I think a lot of data professionals in this field have built up lots of knowledge that's quite useful, specific to their company, specific to their roles, specific to certain types of problems and that knowledge lives inside their head. It's not written in a book, it's not in a manual, but they have this knowledge. And so the more you talk with people, the more you meet people and talk about these different problems that you have, the faster you can grow and the faster you can learn. Instead of having to learn these things on your own and kind of solve the problems bit by bit, you can work with other people and they can just help you to solve these so much quicker because they've already solved the problem and learned that information. We've come to the end of the first section of your statistics course. Wow, you've learned a lot already. Along the way, we've explored how data professionals use statistics to gain insights from their data. This helps business leaders make decisions and solve complex problems. We begin with the two main types of statistics, descriptive and inferential. Data professionals use descriptive stats to explore and summarize their data. They use inferential stats to draw conclusions and make predictions about their data. During your tour of descriptive stats, you learned about measures of central tendency, dispersion and position. Finally, you learned that Python is a powerful tool for statistical analysis. You use Python to explore a data set and quickly calculate descriptive statistics to summarize your data. You can use these skills to better understand any new data set you may encounter in your future career. Coming up, you have a graded assessment. To prepare, check out the reading that lists all the new terms you've learned and feel free to revisit videos, readings and other resources that cover key concepts. Congrats on your progress so far and I'll meet you again soon. Hey there. I really enjoyed exploring descriptive statistics with you and I'm excited for what's next, probability. Probability is the branch of mathematics that deals with measuring and quantifying uncertainty. In other words, probability uses math to describe the likelihood of something happening. For example, the chance of rain tomorrow or of winning the lottery. Data professionals use probability to help business leaders make data-driven decisions and situations of uncertainty. No one can know the outcome of future events with complete certainty. What data professionals can do is use all the available data to make reasonable predictions based on probability. For instance, imagine you're working with a stakeholder at a large aerospace company. They need to decide whether or not to invest in a new technology to improve the production process for their jet engines. As a data professional, you can estimate the probability that the new technology will have a positive impact and predict what its potential costs and benefits might be. The stakeholder can use this information to make an informed decision about what's best for the organization. We'll start by reviewing the two main types of probability, objective and subjective. We'll cover basic rules of probability like the complement rule, the addition rule and the multiplication rule. Then we'll go over conditional probability and how to describe the relationship between dependent events. We'll check out Bayes theorem, a key formula for conditional probability and the basis for more advanced Bayesian analysis. You'll also learn about probability distributions. Probability distributions describe the likelihood of the possible outcomes of a random event and can be discrete or continuous. We'll check out discrete probability distributions such as the binomial and Poisson and find out how they can help you model specific kinds of data. Then we'll explore continuous probability distributions and focus on the normal distribution, the most widely used distribution in all statistics. You'll discover its main features and how it applies to many different data sets. Next, we'll also discuss how Z-scores can help you better understand the relationship between data values in a standard normal distribution. Finally, you'll learn how to use Python's PsiPy stats module to apply probability distribution to your data. When you're ready to start learning about probability, join me in the next video. Probability helps you measure and quantify uncertainty and make informed decisions about uncertain outcomes. For example, you might use probability to decide what to wear on a given day. Today's weather forecast says there's a 70% chance of snow. Based on this data, you decide to wear your hat, gloves and snow boots. When the snow falls, you stay warm and dry. Data professionals might use probability to predict the chances that a company will sell a certain amount of product in a given time period. A financial investment will have a positive return. A political candidate will win an election or a medical test will be accurate. In this video, we'll explore the two main types of probability, objective and subjective. Objective probability is based on statistics, experiments and mathematical measurements. Subjective probability is based on personal feelings, experience or judgment. Let's start with objective probability. Data professionals use objective probability to analyze and interpret data. There are two types of objective probability, classical and empirical. Classical probability is based on formal reasoning about events with equally likely outcomes. To calculate classical probability for an event, you divide the number of desired outcomes by the total number of possible outcomes. For example, if you flip off a coin, the result will be either heads or tails. Heads and tails are terms commonly used to refer the two sides of a coin. There are only two possible outcomes and both outcomes are equally likely. The chance that you get heads is one out of two or 50%. The same goes for tails. Or take playing cards. There are 52 cards in a standard deck. Choosing a card gives you a one in 52 chance or 1.9% of getting any card in the deck, whether it's the ace of hearts, 10 of clubs or four of spades. But most events are more complex and do not have equally likely outcomes. Usually the weather isn't a 50% chance of rain or snow. There might be an 80% chance of rain tomorrow and a 20% chance of some other outcome. While classical probability applies to events with equally likely outcomes, data professionals use empirical probability to describe more complex events. Empirical probability is based on experimental or historical data. It represents the likelihood of an event occurring based on the previous results of an experiment or past events. To calculate empirical probability, you divide the number of times a specific event occurs by the total number of events. For example, say you conduct a taste test with 100 people to find out whether they prefer strawberry or mint chip flavored ice cream. You want to know the probability that a person prefers strawberry ice cream. Your taste test reveals that 80 people prefer strawberry ice cream. To calculate probability, you divide the number of times the event of preferring strawberry ice cream occurs, 80, by the total number of events, 100. 80 divided by 100 equals 0.8 or 80%. So the probability that a person prefers strawberry over mint chip is 80%. Earlier, you learned about inferential statistics and how data professionals use sample data to make inferences or predictions about larger populations. Inferential statistics uses probability too. For instance, a retail company might survey a representative sample of 100 customers to predict the shopping preferences of all their customers. Data professionals rely on empirical probability to help them make accurate predictions based on sample data. For example, in an A-B test of a website, you test a sample of users to make a prediction about the future behavior of all users. Say the sample of users prefer a green add to cart button over a blue one. You may infer from this data that the larger population of future users will probably share their preference. An A-B test lets you make a reasonable prediction about future users based on empirical probability. And this probability can help an online business make smarter decisions and increase sales. In contrast, the results of subjective probability are based on personal feeling, experience or judgment. This type of probability does not involve formal calculations, statistical analysis or scientific experiments. For instance, you may have an overwhelming feeling that a certain horse will win a horse race or that your favorite team will win the championship game. And you may have good reasons for your belief, but your reasons are personal or subjective. Your belief is not based on statistical analysis or scientific experiments. For this reason, the subjective probability of an event may differ widely from person to person. It's important to know the difference between subjective and objective probability when you evaluate a prediction or make a decision. For example, the CEO of an auto company might feel confident that using a new technology to manufacture their pickup truck will cut costs and increase profits. But if their prediction is only based on personal feeling or subjective probability, it may not be reliable. Data science based on statistical analysis or objective probability can help accurately predict the potential impact of the new technology and help the CEO make an informed, data-driven decision about adopting the technology. That's all for now. Coming up, we'll check out some fundamental concepts of probability. Recently, you learned that probability uses math to deal with uncertainty or to determine how likely it is that an event will occur. In this video, you'll learn some fundamental concepts of probability. We'll discuss the mathematical definition of probability and how to calculate probability for single random events. First, I wanna give you some context about the types of examples we'll be using. In this part of the course, we're going to continue to reference examples of events like flipping coins, rolling dice, and drawing cards. There are a couple reasons for this. One is historical. The modern theory of probability originates in the analysis of games of chance in the 16th and 17th centuries. Second, and more importantly, these are events with clearly defined outcomes that most people are familiar with. They're just super useful examples of basic probability concepts. That's why they're used in stats classes around the world. Later on in the course, we'll explore probability for more complex events like the ones you'll encounter in your future work as a data professional. So let's talk about the fundamental concepts of probability. First, the probability that an event will occur is expressed as a number between zero and one. If the probability of an event equals zero, there is a 0% chance that the event will occur. If the probability of an event equals one, there is a 100% chance the event will occur. And there are lots of possibilities in between zero and one. If the probability of an event equals 0.5, there is a 50% chance that the event will occur or not occur. If the probability of an event is close to zero, there's a small chance that the event will occur. If the probability of an event is close to one, there's a strong chance that the event will occur. For example, if the chance of a stock price going up this year is 0.05 or 5%, then you probably don't want to buy it. If it's 0.95 or 95%, then it's probably a good investment. Probability measures the likelihood of random events. The result of a random event cannot be predicted with certainty. Before flipping a coin or rolling a die, you do not know the outcome. The coin could turn up heads or tails and a die could show any number one through six. These are examples of what statisticians call a random experiment, also known as a statistical experiment. A random experiment is a process whose outcome cannot be predicted with certainty. All random experiments have three things in common. The experiment can have more than one possible outcome. You can represent each possible outcome in advance and the outcome of the experiment depends on chance. Let's take the example of flipping a coin. There's more than one possible outcome. You can represent each possible outcome in advance, heads or tails, and the outcome depends on chance. Until you actually toss the coin, you can't know whether it will be heads or tails. Or think about rolling a six-sided die. There's more than one possible outcome and all outcomes can be represented in advance. One, two, three, four, five, and six. The outcome of any rule depends on chance. Until you roll the die, you can't know which number will turn up. To calculate the probability in a random experiment, you divide the number of desired outcomes by the total number of possible outcomes. You may recall that this is also the formula for classical probability. So the probability of tossing a coin and getting heads is one chance and two. This is one divided by two equals 0.5 or 50%. The probability of rolling a die and getting two is one chance out of six. This is one divided by six equals 0.166, repeating, or about 16.7%. Now, let's conduct a different random experiment. Imagine a jar contains 10 marbles. Two marbles are red, three are green, and five are blue. You decide to take one marble from the jar. You want to know the probability that the marble will be green. First, count the number of possible outcomes. You have an equal chance of choosing any one of the 10 marbles. Next, figure out how many of these outcomes refer to what you want to know, the chance of choosing a green marble. Of the 10 total marbles, three are green. Therefore, the probability of choosing a green marble is three out of 10 or 0.3. In other words, you have a 30% chance of choosing a green marble. Now you know how to calculate the probability of a single random event. This knowledge will be useful as a building block for more complex calculations of probability. So far, we've been focusing on calculating the probability of single events. Many situations, both in everyday life and in data analytics, involve more than one event. As a future data professional, you'll often deal with probability for multiple events. In this video, we'll cover three basic rules of probability, the complement rule, the addition rule, and the multiplication rule. These rules help you better understand the probability of multiple events. We'll also discuss two different types of events, mutually exclusive events and independent events. Then you'll learn how to calculate probability for each of them. First, let's discuss probability notation, which is the standard way to symbolize probability concepts. As we go along, I'll share some useful notations that will help us communicate more efficiently when it comes to basic probability. The letter P indicates the probability of an event. For example, if you're dealing with two events, you can label one event A and the other event B. The notation for the probability of event A is the letter P followed by the letter A in parentheses. For the probability of event B, it's the letter P followed by the letter B in parentheses. If you want to talk about the probability of event A not occurring, add an apostrophe after the letter A. You can also say this is the probability of not A. Now let's check out our first basic rule, the complement rule. In stats, the complement of an event is the event not occurring. For example, either it rains or it does not rain. Either you win the lottery or you don't win the lottery. The complement of rain is no rain, the complement of winning is not winning. The important thing to know is that the two probabilities, the probability of an event happening and the probability of it not happening, must add to one. Recall that a probability of one is the same as saying there's 100% certainty of an event occurring. Another way to think about it is that there is a 100% chance of one event or the other event happening. There may be a 30% chance of rain tomorrow, but there is a 100% chance that it will either rain or not rain tomorrow. The complement rule says that the probability that event A does not occur is one minus the probability of event A. For example, if the weather forecast says there is a 30% chance of rain tomorrow, there is a probability of 0.3. You can use the complement rule to calculate the probability that it does not rain tomorrow. The probability of no rain equals one minus the probability of rain. This is one minus 0.3 equals 0.7 or 70%. Both the complement rule and our next rule, the addition rule, apply to events that are mutually exclusive. Two events are mutually exclusive if they cannot occur at the same time. For example, you can't visit both Argentina and China at the same time or turn left and right at the same time. The addition rule says that if the events A and B are mutually exclusive, then the probability of A or B happening is the sum of the probabilities of A and B. Let's check out an example using a six-sided die. Say you want to find out the probability of rolling either a two or a four on a single roll of the die. These two events are mutually exclusive. You can roll a two or a four, but not both at the same time. The addition rule says that to find the probability of either event happening, you should sum up their probabilities. The odds of rolling any single number on a six-sided die are one out of six. So the probability of rolling a two is one over six and the probability of rolling a four is one over six. One sixth plus one sixth equals one third. The probability of rolling either a two or a four is one out of three or 33%. The addition rule applies to mutually exclusive events. If you want to calculate probability for independent events, you can use the multiplication rule. Two events are independent if the occurrence of one event does not change the probability of the other event. This means that one event does not affect the outcome of the other event. For example, checking out a book from your local library does not affect tomorrow's weather. Drinking coffee in the morning does not affect the delivery of your mail in the afternoon. These events are separate and independent. The multiplication rule says that if the events A and B are independent, then the probability of both A and B happening is the probability of A multiplied by the probability of B. For instance, imagine two consecutive tosses. Say you want to know the probability of tails on the first toss and heads on the second toss. First, figure out what kind of events you're dealing with and then apply the appropriate rule. Two coin tosses are independent events. The first toss does not affect the outcome of the second toss. For any toss, the probability of getting either heads or tails always remains one out of two or 50%. So you would use the multiplication rule for this event. The probability of getting tails and heads is the probability of getting tails multiplied by the probability of getting heads. The probability of each event is 0.5 or 50%. Now, plug in the numbers. 0.5 times 0.5 equals 0.25 or 25%. The probability of getting tails on the first toss and heads on the second toss is 25%. So to recap, let's compare the addition and multiplication rules and list their differences. It'll be helpful to keep these differences in mind so you know when to use the two rules. The addition rule sums up the probabilities of events and the multiplication rule multiplies the probabilities. The addition rule applies to events that are mutually exclusive. The multiplication rule applies to events that are independent. The basic rules of probability help you describe events that are mutually exclusive or independent. In an upcoming video, we'll check out conditional probability which applies to dependent events. So far, you've learned how to calculate probability for a single event and for two or more independent events. Remember, two events are independent if one event does not affect the outcome of the other event, like two coin flips. In this video, you'll learn how to calculate probability for two or more dependent events. This type of probability is known as conditional probability. Conditional probability refers to the probability of an event occurring given that another event has already occurred. Conditional probability is used in many different fields like finance, insurance, science, and machine learning. For example, an agency that sells life insurance might use conditional probability to decide how risky it is to ensure someone who skydives for a living. Data professionals, like those who work on machine learning models, use conditional probability to make accurate predictions about complex datasets. Before we get into calculating conditional probability, let's go over the concept of dependence. Two events are dependent if the occurrence of one event changes the probability of the other event. This means that the first event affects the outcome of the second event. For instance, if you want to visit a website, you first need internet access. Visiting a website depends on you having access to the internet. If you want to travel to another country, you first need to get a passport. Traveling to another country depends on you having a passport. In each instance, we can say that the second event is dependent on or conditional on the first event. Let's check out an example of dependence that's closer to probability theory. Imagine you have two events. The first event is drawing an ace from a standard deck of playing cards, and the second event is drawing another ace from the same deck. There are four aces in a deck of 52 cards. So for the first draw, the chance of getting an ace is four out of 52, or 7.8%. But for the second draw, the probability of getting an ace changes because you've removed a card from the deck. Now there are three aces in a deck of 51 cards. For the second draw, the chance of getting an ace is three out of 51 or about 5.8%. Getting an ace is now less likely. These two events are dependent because getting an ace on the first draw changes the probability of getting an ace on the second draw. Now you have a better understanding of dependent events. Let's return to conditional probability and check out the formula. You don't need to memorize the formula. But personally, I find that reviewing the formula often helps me understand the concept better. That's why I'm sharing it with you. The formula says that for two dependent events, the probability of event A and event B occurring equals the probability of event A times the probability of event B given A. You may notice that we have a new notation in this formula. The vertical bar between the letters B and A means that event B depends on event A happening. We say this as the probability of a B given A. The formula can also be expressed as the probability of event B given event A equals the probability that both A and B occur divided by the probability of A. These are just two ways of representing the same equation. Depending on the situation or what information you're given up front, it may be easier to use one or the other. We can apply the conditional probability formula to our example of drawing an ace playing card. The probability of event A or getting an ace on the first draw is four out of 52. The probability of event B given event A or of getting an ace on the second draw is three out of 51. Let's enter these numbers into the formula. The probability of event A and event B or of getting two aces in a row is four over 52 multiplied by three over 51. If you do the math, this equals one over 221. The probability of getting two aces in a row equals one out of 221 or about 0.5%. Let's check out another example. Imagine you're applying for college. The college accepts 10 out of every 100 applicants or 10%. If you're accepted, you also hope to receive an academic scholarship. The college awards academic scholarships to two out of ever 100 accepted students or 2%. You want to calculate the probability that you get accepted and you get a scholarship. Getting a scholarship depends on first getting accepted. So this is a conditional probability because it deals with two dependent events. Let's call getting accepted event A and getting a scholarship event B. You want to calculate the probability of event A and event B. According to the formula, to find the probability of event A and event B, you can multiply the probability of event A by the probability of event B given event A. The probability of event A getting accepted is 10 out of 100. The probability of event B given event A or getting a scholarship given that you are first accepted is two out of 100. 10 divided by 100 times two divided by 100 equals one divided by 500. So the probability of getting accepted and getting a scholarship is one out of 500 or 0.2%. Conditional probability helps you better understand the relationship between dependent events. As a data professional, I often use conditional probability to predict how an event like an ad campaign will impact sales revenue. Then I share my findings with stakeholders so they can make more informed business decisions. Earlier, you learned that conditional probability refers to the probability of an event occurring given that another event has already occurred. For example, when you draw an ace from a deck of playing cards, this changes the probability of drawing a second ace from the same deck. In this video, you'll learn how to calculate conditional probability using Bayes theorem. Bayes theorem, also known as Bayes rule, is a math formula for determining conditional probability. It's named after Thomas Bayes, an 18th century mathematician from London, England. Bayes theorem provides a way to update the probability of an event based on new information about the event. In Bayesian statistics, prior probability refers to the probability of an event before new data is collected. Posterior probability is the updated probability of an event based on new data. Posterior means occurring after. Posterior probability is calculated by updating the prior probability using Bayes theorem. For example, let's say a medical condition is related to age. You can use Bayes theorem to more accurately determine the probability that a person has the condition based on age. The prior probability would be the probability of a person having the condition. The posterior or updated probability would be the probability of a person having the condition if they're in a certain age group. Bayes theorem is the foundation for the field of Bayesian statistics, also known as Bayesian inference, which is a powerful method for analyzing and interpreting data in modern data analytics. Data professionals apply Bayes theorem in a wide variety of fields, from artificial intelligence to medical testing. For instance, financial institutions use Bayesian analysis to rate the risk of lending money to borrowers or to predict the success of an investment. Online retailers use Bayesian algorithms to predict whether or not users will like certain products and services. Marketers rely on Bayes theorem for identifying positive or negative responses in customer feedback. Let's check out the theorem itself. As always, don't worry about memorizing it. Bayes theorem is a bit complicated and this is the basic version of it. Bayes theorem says that for any two events, A and B, the probability of A given B equals the probability of A multiplied by the probability of B given A divided by the probability of B. In math terms, prior probability is the probability of event A. Posterior probability, or what you're ultimately trying to figure out, is the probability of event A given event B. The key for Bayes theorem is that it includes both the conditional probability of B given A and the conditional probability of A given B. If you know one of these probabilities, Bayes theorem can help you determine the other. Let's check out an example. Say you're in planning a big outdoor event like a graduation party. The success of the event depends on good weather. On the day of the event, you notice that the morning is cloudy. You want to find out the chance of rain, given that this day starts off cloudy. If there's a high probability of rain, you may decide to move the event indoors or even cancel it. You know the following information. At this time of year, the overall chance of rain is 10%. However, cloudy mornings are common, about 40% of all days start off cloudy, and 50% of all rainy days start off cloudy. In this example, your prior probability is the overall probability of a rainy day. New data will update this probability. In this case, the knowledge that the morning is cloudy and that rain may be coming. What you ultimately want to find out is the probability that it will rain, given that it's cloudy. This is your posterior probability. You can use Bayes Theorem to update the prior probability that it rains based on the new data that the morning is cloudy. When you work with Bayes Theorem, it's helpful to first figure out what event A is and what event B is. This makes it easier to understand the relationship between events and use the formula. Let's use the word rain to refer to event A, the probability of rain. This is your prior probability. Event B is the probability that the day will be cloudy. Let's use the word cloudy to refer to event B. Now you can rewrite the probability of event B given event A as the probability that it's cloudy given that it rains. Finally, the probability of event A given event B is the probability that it rains given that it's cloudy. This is your posterior probability or the updated probability that Bayes Theorem will help you calculate. Finally, enter what you know into the formula. The probability of rain is 10%. The probability that it's cloudy is 40%. The probability that it's cloudy given that it rains is 50%. The probability of rain given that it's cloudy equals 0.1 times 0.5 divided by 0.4. This equals 0.125 or 12.5%. So there is a 12.5% chance of rain today. This is your posterior probability or the updated probability based on the data that the morning is cloudy. The odds are still in your favor so you decide to proceed with your outdoor party. Hope it's a fun one. You've already learned that Bayes Theorem tells you how to update the probability of an event based on new data about the event. But there are several different versions of Bayes Theorem. They're written in different ways and used for different types of problems. In this video, you'll learn about an expanded version of Bayes Theorem and how to use it to predict the accuracy of a test. The expanded version of Bayes Theorem is long. If you're not an experienced statistician, it may seem quite intimidating. You don't need to worry about memorizing this formula. What's important to know is that the expanded version works better than the basic version in certain situations. The theorem goes like this. The probability of event A given event B equals the probability of B given A multiplied by the probability of A divided by the probability of B given A times the probability of A plus the probability of B given not A multiplied by the probability of not A. Well, that was a lot. You can use the two versions of Bayes Theorem to deal with different types of problems. Sometimes, for instance, you don't know the probability of event B, which is in the denominator of the equation for the basic Bayes Theorem. In that case, you can use the expanded version of Bayes Theorem because you don't need to know the probability of event B to use the expanded version. This longer version of Bayes Theorem is often used to evaluate tests such as medical diagnostic tests, quality control tests, or software tests such as spam filters. When evaluating the accuracy of a test, Bayes Theorem can take into account the probability for testing errors known as false positives and false negatives. A false positive is a test result that indicates something is present when it really is not. For example, a spam filter may incorrectly identify a legitimate email, a spam. False positives often refer to medical testing, but they also apply to other areas like software testing. For instance, antivirus software may indicate that a computer file is a virus, even though the file is normal. A false negative is a test result that indicates something is not present when it really is. For example, a spam filter may incorrectly identify a spam email as a legitimate. False negatives also apply to all kinds of tests. In manufacturing, for instance, a quality control test may incorrectly identify a defective part as an acceptable part. Next, let's explore a detailed example of how to use the longer Bayes Theorem to evaluate a test. Let's say you want to evaluate the accuracy of a diagnostic test that checks for the presence of a peanut allergy. Suppose that 1% of the population is allergic to peanuts. Based on historical data, if a person has the allergy, there's a 95% chance that the test is positive. If a person doesn't have the allergy, there's still a 2% chance that the test is positive. This is a false positive because it's a positive result for a person who does not actually have the allergy. You want to know, keeping that a person tests positive, what are the chances that they actually have the allergy? You can also think of the situation in terms of prior and posterior probability, which you learned about earlier in connection with the basic version of Bayes Theorem. You start off with the prior probability that a person has the allergy. This is 1%. Then you'll update this prior probability with new data based on testing, the probability of getting true positive and false positive test results. Finally, you want to figure out the posterior probability that the allergy is present given that the test is positive. There are two main events in this situation. First, actually having the allergy. Second, testing positive. Let's call having the allergy event A and testing positive event B. Remember, these two events are different because you can test positive and not have the allergy, which is a false positive. Now, let's review what you know. First, there is the probability that a person actually has the allergy, which is 1%. So the probability of event A equals 1%. Next, there is a 95% chance that a test is positive if the person has the allergy. This is a conditional probability for two dependent events. The probability of a positive test given that the allergy is present. So the probability of event B given event A equals 95%. Then there's the false positive result, the 2% chance that the test is positive given that the allergy is not present. This is another conditional probability, the probability of event B given not A equals 2%. Finally, if you use the complement rule, you can also figure out one more probability, the probability of not having the allergy. The complement rule says that the probability that event A does not occur is one minus the probability of event A. So if the probability of event A actually having the allergy is 1% or 0.01, then the probability of not having the allergy is one minus 0.01. This equals 0.99 or 99%. So the probability of not A equals 99%. These are the probabilities you know. What you don't know is the probability of event B, the probability that a person gets a positive test result. This is where you'd have trouble using the basic version of base theorem because the probability of event B is part of the formula. Instead, you can use the expanded version since you don't need to know the probability of event B for that formula. Now you can enter what you know into the formula. The probability of A is 1% or 0.01. The probability of not A equals 99% or 0.99. The probability of B given A equals 95% or 0.95. And the probability of B given not A equals 2% or 0.02. If you do the math, the result is 0.324 or 32.4%. So the probability of event A given event B or the probability that the allergy is present given that the test is positive is 32.4%. If 32.4% seems low to you, it's because the allergy is rare to begin with. It's not very likely that a random person will both test positive and have the allergy. The expanded version of base theorem gives you a better understanding of the accuracy of the test by taking into account multiple probabilities. That's all for now. So far, we've covered a lot of key concepts in basic probability. What you've learned about basic probability will help you better understand probability distributions or main topic for this part of the course. In my job as a data professional, I use probability distributions to model different kinds of data sets and to identify significant patterns in my data. A probability distribution describes the likelihood of the possible outcomes of a random event. Probability distributions can represent the possible outcomes of simple random events like tossing a coin or rolling a die. They can also represent more complex events like the probability of a new medication successfully treating a medical condition. A random variable represents the values for the possible outcomes of a random event. There are two types of random variables, discrete and continuous. A discrete random variable has a countable number of possible values. Often, discrete variables are whole numbers that can be counted. For example, if you roll a die five times, you can count the number of times the die lands on two. If you toss a coin five times, you can count the number of times it lands on heads. A continuous random variable takes all the possible values in some range of numbers. When it comes to continuous variables, you're dealing with decimal values rather than whole numbers. For instance, all the decimals values between one and two, such as 1.1, 1.12, 1.125, and so on. These values are not countable since there is no limit to the possible number of decimal values between one and two. Typically, these are decimal values that can be measured, such as height, weight, time, or temperature. For example, if you measure the height of a person or object, you can keep on making your measurement more accurate. The height of a person could be 70.2 inches, 70.23 inches, 70.237 inches, 70.2375 inches, and so on. There is no limit to the number of possible values. It's not always immediately obvious if a variable is discrete or continuous. To help choose between the two, you can use the following general guidelines. If you can count the number of outcomes, you are working with a discrete random variable. For example, counting the number of times a coin lands on heads. If you can measure the outcome, you are working with a continuous random variable. For example, measuring the time it takes for a person to run a marathon. Now that we've explored random variables, let's return to the topic of probability distributions, which describe the probability of each possible value of a random variable. Discrete distributions represent discrete random variables, and continuous distributions represent continuous random variables. Once you know the sample space of a random variable, you can assign probabilities to each of the possible values. In statistics, you can use the term sample space to describe the set of all possible values for a random variable. For example, a single coin toss is a random variable with two possible values, heads and tails. So the sample space is heads and tails. If you roll a six-sided die, you have a random variable with six possible values, or a sample space of one, two, three, four, five, and six. Let's check out an example of a discrete probability distribution. Take the familiar random event of a single die roll. The sample space for a single die roll is one, two, three, four, five, and six. The probability of each outcome is the same, one out of six, or 16.7%. You can display a discrete probability distribution as a table or a graph. The distribution table summarizes the probability for each possible outcome. The top row lists each outcome of the die roll, and the bottom lists the corresponding probability. The bar graph, or histogram, shows the same probability distribution, but in a different form. For a discrete probability distribution, the random variable is plotted along the x-axis, and the corresponding probability is plotted along the y-axis. In this case, the x-axis represents each possible outcome of a single die roll, one through six. The y-axis represents the probability of each outcome. Continuous probability distributions in their graphs work a little differently from discrete distributions. This is due to the difference between discrete and continuous random variables. The probability distribution for a discrete random variable can tell you the exact probability for each possible value of the variable. For instance, the probability of rolling a die and getting a three is one out of six, or about 16.7%. The probability distribution for a continuous random variable can only tell you the probability that the variable takes on a range of values. Let's check out an example to learn more. A continuous random variable may have an infinite number of possible values. Imagine you want to measure the height of an oak tree you picked at random from a nearby forest. In this example, the height of the tree is a continuous random variable. The tree's height could be, say, 15 feet, or 15.2 feet, or 15.2187 feet, and so on. You can keep on adding another decimal place to the measurement without limit. Now, say you want to know the probability that the height of the oak tree is exactly 15.2 feet. Because the height of the tree could be any decimal value between the range of 15 feet and 16 feet, the probability that the tree is exactly any single value is essentially zero. In this example, you'll need to use continuous probability distribution to tell you the probability that the height of the oak tree is in a certain range or interval, such as between 15 feet and 16 feet. The probability of any specific value is zero, so it only makes sense to talk about the probabilities of intervals. A convenient way to show the probabilities of a range or interval of values is with a curve. On a graph, continuous distributions appear as curves. You may have heard of the bell curve, which refers to the graph for a continuous distribution called the normal distribution. On the curve, the x-axis refers to the value of the variable you're measuring, in this case, oak tree height. The y-axis refers to something called probability density. This is a math function that deals with the values of intervals. You don't need to focus on the math part right now. Just know that probability density is not the same thing as probability. There's a lot more to learn about probability distributions and how they can help you model different kinds of data. These topics are complex, so feel free to revisit the video to go over this part again. Recently, you learned about discreet probability distributions, which represent discreet random events, like tossing a coin or rolling a die. Often, the outcomes of discreet events are expressed as whole numbers that can be counted. For example, the number of times a coin lands on heads and 10 tosses. In this video, you'll learn about one of the most widely used discreet probability distributions, the binomial distribution. The binomial distribution is a discreet distribution that models the probability of events with only two possible outcomes, success or failure. This definition assumes that each event is independent or does not affect the probability of the others and that the probability of success is the same for each event. For example, the binomial distribution applies to an event like tossing the same coin 10 times in a row. Keep in mind that success and failure are labels used for convenience. For example, each toss has only two possible outcomes, heads or tails. You could choose to label either heads or tails as a successful outcome based on the needs of your analysis. Whatever label you apply to the outcomes, it's important to know that they must be mutually exclusive. As a quick refresher, two outcomes are mutually exclusive if they cannot occur at the same time. You can't get both heads and tails in a single coin toss. It's either one or the other. Data professionals use the binomial distribution to model data in different fields, such as medicine, banking, investing, and machine learning. For example, data professionals use binomial distribution to model the probability that a new medication generates side effects, a credit card transaction is fraudulent, or a stock price rises or falls in value. In machine learning, the binomial distribution is often used to classify data. For example, a data professional may train an algorithm to recognize whether a digital image of an animal is or is not a cat. The binomial distribution represents a type of random event called a binomial experiment. A binomial experiment is a type of random experiment. You may recall that a random experiment is a process whose outcome cannot be predicted with certainty. All random experiments have three things in common. The experiment can have more than one possible outcome. You can represent each possible outcome in advance, and the outcome of the experiment depends on chance. On the other hand, a binomial experiment has the following attributes. The experiment consists of a number of repeated trials. Each trial has only two possible outcomes. The probability of success is the same for each trial, and each trial is independent. An example of a binomial experiment is tossing a coin 10 times in a row. This is a binomial experiment because it has the following features. The experiment consists of 10 repeated trials or coin tosses. Each trial has only two possible outcomes, heads or tails. The probability of success is the same for each trial. If you define success as heads, then the probability of success for each toss is the same, 50%. Each trial is independent. The outcome of one coin toss does not affect the outcome of any other coin toss. Let's check out another example of a binomial experiment. Suppose you want to know how many customers return an item to a department store on a given day. Say 100 customers visit the store each day. 10% of all customers who visit the store make a return. You label a return as a success. This is a binomial experiment because there are 100 repeated trials or customer visits. Each trial only has two possible outcomes, return or not return. If you label return as success, the probability of success for each customer visit is the same, 10%. Each trial is independent. The outcome of one customer visit does not affect the outcome of any other customer visit. It's important to understand the features of a binomial experiment because a binomial distribution can only model data for this type of event. If you're working with data for a different type of event, you need to use a different type of probability distribution, like the Poisson, to model the data. Once you've determined that your distribution is binomial, you can apply the binomial distribution formula to calculate the probability. No need to memorize it. You can use your computer to make the calculations. If you want to learn more, feel free to check out the relevant reading. In brief, the binomial distribution formula helps you determine the probability of getting a certain number of successful outcomes and a certain number of trials. For example, getting a certain number of heads and a certain number of coin flips. In this formula, K refers to the number of successes and refers to the number of trials. P refers to the probability of success on a given trial and N choose K refers to the number of ways to obtain K successes in N trials. Let's explore our departments for example to better understand how the formula works. This time, suppose 10% of all customers who visit the store make a return. Imagine that three customers visit the store. You label a return as a success. You can use the formula to determine the probability of getting zero, one, two, and three returns among the three customers. In the calculation, X refers to the number of returns. I'll skip the calculations and go directly to the results. If you plug in for the probability that X equals zero returns, the result is 0.729. For the probability that X equals one return, the result is 0.243. For the probability that X equals two returns, the result is 0.027. For the probability that X equals three returns, the result is 0.001. You can then use a histogram to visualize this probability distribution. For a discrete probability distribution, like the binomial distribution, the random variable is plotted along the X axis and the corresponding probability is plotted along the Y axis. In this case, the X axis shows the visits per hour, zero, one, two, and three. The Y axis shows the probability of getting each result. The binomial distribution lets you model the probability of events with only two possible outcomes, success or failure. Identifying the distribution of your data is a key step in any analysis and helps you make informed predictions about future outcomes. As a data professional, knowing about probability distributions is super useful because different types of distributions help you model different kinds of data. Every time I work with a new data set, I try to understand if there's a pattern present in the distribution data. Knowing the probability distribution of my data also helps me choose the machine learning model that works best. This way, I'm able to get a better result in less time. Data professionals work with many different types of probability distributions. As you advance in your career and continue to learn, you can explore different distributions and discover how they apply to your work. In this part of the course, we're focusing on two of the most common discrete probability distributions, the binomial and the Poisson. Earlier, you learned that the binomial distribution represents experiments with repeated trials that each have two possible outcomes, success or failure. In this video, you'll learn about the main features of the Poisson distribution. The Poisson distribution is a probability distribution that models the probability that a certain number of events will occur during a specific time period. The Poisson distribution can also be used to represent the number of events that occur in a specific space, such as a distance, area, or volume. But in this course, we'll focus on time. Baron Simeon Dennis Poisson, a French mathematician, originally derived the Poisson distribution in 1830. He developed the distribution to describe the number of times a gambler would win a difficult game of chance in a large number of tries. Data professionals use the Poisson distribution to model data, such as the expected number of calls per hour for a customer service call center, visitors per hour for a website, customers per day at a restaurant, and severe storms per month in a city. The Poisson distribution represents a type of random experiment called a Poisson experiment. A Poisson experiment has the following attributes. The number of events in the experiment can be counted. The mean number of events that occur during a specific time period is known, and each event is independent. Let's explore an example. Imagine you're a data professional working for a large restaurant chain that serves fast food. You know that the drive-through service at a restaurant receives an average of two orders per minute. You want to determine the probability that a restaurant will receive a certain number of orders in a given minute. This is a Poisson experiment because the number of events in the experiment can be counted. You can count the number of orders. The mean number of events that occur during a specific time period is known. There's an average of two orders per minute. Each outcome is independent. The probability of one person placing an order does not affect the probability of another person placing an order. Once you know that you're working with the Poisson distribution, you can apply the Poisson distribution formula to calculate the probability. In brief, the formula helps you determine the probability that a certain number of events occurring during a specific time period. In this formula, the Greek letter lambda refers to the mean number of events that occurred during a specific time period. K refers to the number of events. E is a constant equal to approximately 2.71828. The exclamation point stands for factorial, a function that multiplies a number by every whole number below it down to one. For example, two factorial is two multiplied by one. Let's continue with our restaurant chain example to better understand how the formula works. Recall that the drive-through service at a restaurant receives an average of two orders per minute. You can use the Poisson formula to determine the probability of the restaurant receiving zero, one, two, or three orders in a given minute. Knowing this information may help the restaurant organize staffing for the drive-through. I'll skip the calculations and go directly to the results. If you plug in for the probability that x equals zero orders, the result is 0.1353. For the probability that x equals one order, the result is 0.2707. For the probability that x equals two orders, again, the result is 0.2707. For the probability that x equals three orders, the result is 0.1805. You can then use a histogram to visualize the probability distribution. The x-axis shows the number of events. In this case, orders per minute. The y-axis shows the probability of occurrence. For example, the probability of getting zero orders in a minute is about 0.1353 or 13.53%. The probability of one order is 0.2707 or 27.07%. The probability of two orders is also 0.2707 or 27.07%. The probability of three orders is 0.1805 or 18.05%. Before we finish up, let's compare the two discrete probability distributions you recently learned about, the binomial and the Poisson. Sometimes it can be challenging to figure out if you should use a binomial distribution or a Poisson distribution. To help you choose between the two, you can use the following general guidelines. Use the Poisson distribution if you're given the average probability of an event happening for a specific time period. And you want to find out the probability of a certain number of events happening in that time period. For example, if a call center averages 10 customer service calls per hour, you can use the Poisson distribution to find the probability of getting 12 calls between 2 p.m. and 3 p.m. Use the binomial distribution if you are given the exact probability of an event happening. And you want to find out the probability of the event happening a certain number of times in a repeated trial. For example, if the probability of getting heads for any coin toss is 50%, you can use the binomial distribution to find the probability of getting eight heads in 10 coin tosses. That's all for discrete probability distributions. In your future career as a data professional, you'll use discrete distributions like the binomial and the Poisson to better understand your data and make informed predictions about future outcomes. So far, we've been talking about discrete probability distributions where the outcomes of experiments are represented by countable whole numbers. For example, rolling a die can result in a two or a three but not a decimal value such as 2.178 or 3.394. Now, we'll move from discrete to continuous probability distributions. Recall that continuous probability deals with outcomes that can take on all the values in a range of numbers. Typically, these are decimal values that can be measured such as height, weight, time or temperature. For example, you can keep on measuring time with more accuracy, 1.1 seconds, 1.12 seconds, 1.1257 seconds and so on. In this video, we'll explore the most widely used probability distribution in statistics, the normal distribution. The normal distribution is a continuous probability distribution that is symmetrical on both sides of the mean and bell shaped. The normal distribution is often called the bell curve because its graph has the shape of a bell with a peak at the center and two downward sloping sides. It is also known as the Gaussian distribution after the German mathematician Karl Gauss, who first described the formula for the distribution. While we're on the subject of formulas, if you want to learn more about this formula, please check out the relevant reading where it's discussed in more detail. The normal distribution is the most common probability distribution in statistics because so many different kinds of data sets display a bell shaped curve. For example, if you randomly sample 100 people, you will discover a normal distribution curve for continuous variables such as height, weight, blood pressure, IQ scores, salaries and more. For example, think of the typical results of standardized tests. The majority of people will score close to the average score or mean. Fewer numbers of people will score below or above the average, bother out from the mean. A very small percentage of people will score extremely high or extremely low, very far away from the mean. This distribution of scores generates a bell curve. Most of the data values are relatively close to the mean. The further a value is away from the mean, the less likely it is to occur. On a normal curve, the x-axis refers to the value of the variable you're measuring, and the y-axis refers to how likely you are to observe the value. In the case of test scores, the x-axis is the raw score, and the y-axis is the percentage of the population that gets that score. Data professionals use the normal distribution to model all kinds of different data sets in the fields of business, science, government, machine learning and others. Understanding the normal distribution is also important for more advanced statistical methods such as hypothesis testing and regression analysis, which you'll learn more about later. Plus, many machine learning algorithms assume that data is normally distributed. All normal distributions have the following features. The shape is a bell curve. The mean is located at the center of the curve. The curve is symmetrical on both sides of the center, and the total area under the curve equals one. To clarify the features of the normal distribution, let's graph the weights of honeycrisp apples. Assume that the weights of honeycrisp apples are approximately normally distributed with a mean of 100 grams and a standard deviation of 15 grams. First, find the mean at the center of the curve. This is also the highest point or peak of the curve. This data point represents the most probable outcome in the data set, mean weight of 100 grams. Second, notice that the curve is symmetrical on each side of the mean. 50% of the data is above the mean, and 50% is below the mean. Third, the farther a point is away from the mean, the lower the probability of those outcomes. The points farthest from the mean represent the least probable outcomes in the data set. These are apples that have more extreme weights, either low or high. Finally, the area under the curve is equal to one. This means that the area under the curve accounts for 100% of the possible outcomes in the distribution. On a normal distribution, the distance of a data point from the mean is often measured in standard deviations. As a refresher, the standard deviation calculates the typical distance of a data point from the mean of your data set. While the mean refers to the center of your data, the standard deviation measures spread. As standard deviations become larger, data values become more spread out from the mean. In our apple example, the mean weight is 100 grams, and the standard deviation is 15 grams. So an apple that is one standard deviation above the mean will weigh 115 grams, or the mean weight of 100 grams plus the standard deviation of 15 grams. An apple that is one standard deviation below the mean will weigh 85 grams, or 100 grams minus 15 grams. An apple that's two standard deviations above the mean will weigh 130 grams, and an apple that's two standard deviations below the mean will weigh 70 grams. The values on a normal curve are distributed in a regular pattern based on their distance from the mean. This is known as the empirical rule. It states that for a given data set with a normal distribution, 68% of values fall within one standard deviation of the mean. 95% of values fall within two standard deviations of the mean, and the 99.7% of values fall within three standard deviations of the mean. The empirical rule can give you a clear idea of how the values in your data set are distributed, which helps you save time and better understand your data. Let's continue with our apple example. The empirical rule tells you that most apples, or 68%, will fall within one standard deviation of the mean weight of 100 grams. This means that 68% of apples will weigh between 85 grams, which is one standard deviation below the mean, and 115 grams, one standard deviation above the mean. 95% of apples will weigh between 70 grams and 130 grams, or within two standard deviations from the mean. Almost all apples, or 99.7%, will weigh between 55 grams and then 145 grams, or within three standard deviations of the mean. The empirical rule is useful for estimating data, especially for large data sets, like height and weight data for an entire population. You can use the empirical rule to get an initial estimate of the distribution of values in your data set, such as what percentage of values will fall within one, two, or three standard deviations of the mean. This saves time and helps you better understand your data. Plus, knowing the location of your values on a normal distribution is useful for detecting outliers. Recall that an outlier is a value that differs significantly from the rest of the data. Typically, data professionals consider values that lie more than three standard deviations below or above the mean to be outliers. It's important to identify outliers because some extreme values may be due to errors in data collection or data processing. These false values may skew the results of your analysis. Let's explore another example of how the empirical rule can help you better understand your data. Imagine you have a garden. The height of your plants is normally distributed with a mean of 32.1 inches and a standard deviation of 2.2 inches. Let's say you want to find out what percentage of plants are greater than 29.9 inches tall. You want your plants to be at least this tall as part of your landscape design plan for your backyard. First, find out where the value 29.9 is located on the distribution. 29.9 is located one standard deviation below the mean. The empirical rule tells you that 68% of values fall within one standard deviation of the mean. Half of these values or 34% fall below the mean. Now you know that 34% of values are between 29.9 and the mean because 29.9 is one standard deviation below the mean. Plus 50% of all values in a normal distribution fall above the mean or center of the curve. The sum of these two percentages will tell you the overall percentage of values greater than 29.9. 34% plus 50% equals 84%. So 84% of your plants are greater than 29.9 inches tall. The empirical rule helps you quickly understand the overall distribution of your data values. Now you know that most of your plants are tall enough for your landscape design plan. As a future data professional, you use the normal distribution to identify significant patterns in a wide variety of data sets. Recently, you learned about the normal distribution and how it applies to many different kinds of data sets. In this video, you'll learn about Zscores and how they can help you compare values from different types of normally distributed data sets. A Zscore is a measure of how many standard deviations below or above the population mean a data point is. A Zscore gives you an idea of how far from the mean a data point is. For example, the Zscore is zero if the value is equal to the mean. The Zscore is positive if the value is greater than the mean. The Zscore is negative if the value is less than the mean. Zscores help you standardize your data. In statistics, standardization is the process of putting different variables on the same scale. There is a formula for this, which we'll check out a little later. Zscores are also called standard scores because they're based on what's called the standard normal distribution. A standard normal distribution is just a normal distribution with a mean of zero and a standard deviation of one. Zscores typically range from negative three to positive three. Standardization is useful because it lets you compare scores from different data sets that may have different units, mean values, and standard deviations. Data professionals use Zscores to better understand the relationship between data values within a single data set and between different data sets. For example, data professionals often use Zscores for anomaly detection, which finds outliers in data sets. Applications of anomaly detection include finding fraud and financial transactions, flaws in manufacturing products, intrusions in computer networks, and more. For example, different customer satisfaction surveys may have different rating scales. One survey could rate a product or service from one to 20, another from 500 to 1500, and a third from 130 to 180. Let's say the same product got a score of nine on the first survey, 850 on the second, and 142 on the third. These numbers don't mean much by themselves, but if you know they all have a Zscore of one or a one standard deviation above the mean, you can meaningfully compare ratings across surveys. A Zscore for an individual value can be interpreted as follows. A Zscore of one is one standard deviation above the mean. A Zscore of 1.5 is 1.5 standard deviations above the mean. A Zscore of negative 2.3 is 2.3 standard deviations below the mean. You can use the following formula to calculate a Zscore. Z equals X minus mu divided by sigma. In this formula, X refers to a single data value or raw score. The Greek letter mu refers to population mean. The Greek letter sigma refers to the population standard deviation. So we can also say that Z equals the raw score or data value minus the mean divided by the standard deviation. For example, let's say you take a standardized test. You have a test score of 133. The test has a mean score of 100 and a standard deviation of 15. Assuming a normal distribution, you can use the formula to calculate your Zscore. Your Zscore is your raw score 133 minus the mean score 100 divided by the standard deviation 15. This is 133 minus 100 divided by 15 equals 33 divided by 15 equals 2.2. Your Zscore of 2.2 tells you that your test score is 2.2 standard deviations above the mean or average score. That's a really good score. Recall that the empirical rule tells you that 95% of values fall within two standard deviations of the mean. Your score of 2.2 is more than two standard deviations above the mean. Zscores are useful because they give us an idea of how an individual value compares to the rest of the distribution. Let's take a different exam with a different grading scale. Say you score an 85. You want to find out if that's a good score relative to the rest of the class. Whether or not it's a good score depends on the mean and standard deviation of all exam scores. Suppose the exam scores are normally distributed with a mean score of 90 and a standard deviation of four. You can use the formula to calculate the Zscore of a raw score of 85. Your Zscore is your raw score 85 minus the mean score 90 divided by the standard deviation four. This is 85 minus 90 divided by four equals negative five divided by four equals 1.25. Your Zscore of negative 1.25 tells you that your exam score of 85 is 1.25 standard deviations below the mean or average exam score. Zscores give you an idea of how individual values compare to the mean. As a data professional, you'll use Zscores to help you better understand the relationship between specific values in your data set. You'll most likely use a programming language like Python to calculate Zscores on your computer as you will learn in an upcoming video. When I deal with a new data set, I first go through the process of EDA and compute descriptive stats to get a basic understanding of my data. After that, I try to determine if my data fits a certain type of probability distribution like the binomial, Poisson, and normal distributions you recently learned about. Knowing the distribution of my data helps me decide what statistical test or machine learning model will work best for my analysis. Python has a great selection of function libraries for data analysis. Using Python to work with distributions saves time and improves the overall efficiency of my analysis. In this video, you'll use Python to model your data with the normal distribution. Then you'll compute Zscores to find any outliers in your data. We'll continue with our previous scenario, in which you're a data professional working for the Department of Education of a large nation. Recall that you're analyzing data on the literacy rate for each district, and you've already computed descriptive statistics to summarize your data. You'll continue to use the data set you worked with before. If you need to access the data, do so now. Along with Pandas, NumPy, and Matplotlib.PyPly, you'll use two Python packages that may be new to you, SciPyStats and StatsModels. SciPy is an open source software you can use for solving mathematical, scientific, engineering, and technical problems. It allows you to manipulate and visualize data with a wide range of Python commands. SciPyStats is a module designed specifically for statistics. StatsModels is a Python package that lets you explore data, work with statistical models, and perform statistical tests. It includes an extensive list of stats functions for different types of data. Now that you know more about the packages you'll be working with, let's open up a Jupyter notebook and load them up. To start, import the Python packages you will use, NumPy, Pandas, and StatsModels.API, and the library you will use, Matplotlib.PyPly. To save time, rename each package and library with an abbreviation, MP, PD, PLT, and SM. To load the SciPyStats module right, from SciPy, import Stats. For the next part of your analysis, you want to find out if the data on descriptive literacy rate fits a specific type of probability distribution. The first step in trying to model your data with a probability distribution is to plot a histogram. This will help you visualize the shape of your data and determine if it resembles the shape of a specific distribution. Use Matplotlib's histogram function to plot a histogram of the district literacy rate data. Recall that the overall underscore LI column contains this data. The x-axis of your plot refers to the literacy rate of each district, and the y-axis refers to count or to the number of districts. The histogram shows that the distribution of your literacy rate data is bell-shaped and symmetric about the mean. Recall that the normal distribution is a continuous probability distribution that is bell-shaped and symmetrical on both sides of the mean. The mean literacy rate, which is around 73%, is located in the center of the plot. The shape of your histogram suggests that the normal distribution might be a good modeling option for your data. To verify that your data is normally distributed, you can use Python to find out if your data follows the empirical rule. Recall that the empirical rule says that for every normal distribution, about 68% of values fall within one standard deviation of the mean, 95% fall within two standard deviations of the mean, and 99.7% fall within three standard deviations of the mean. Since the normal distribution seems like a good fit for the district literacy rate data, you can expect the empirical rule to apply relatively well. In other words, you can expect that about 68% of literacy rates will fall within one standard deviation of the mean, 95% will fall within two standard deviations, and 99.7% will fall within three standard deviations. First, name two new variables to store the values for the mean and standard deviation of the district literacy rate. Name your first variable mean underscore overall underscore li and compute the mean. Display the value of your variable. The mean district literacy rate is about 73.4%. Name your second variable std underscore overall underscore li and compute the standard deviation and display the value of your variable. The standard deviation is about 10%. If your data follows the empirical rule, you can expect roughly 68% of your values to fall within one standard deviation of the mean district literacy rate of 73%. One standard deviation below the mean is 63%, or 73 minus one times 10. One standard deviation above the mean is 83%, or 73 plus one times 10. So you can expect roughly 68% of your values to fall within this range of 63 to 83%. Now compute the actual percentage of district literacy rates that fall within one standard deviation of the mean. To do this, first name two new variables, lower underscore limit and upper underscore limit. The lower limit will be one standard deviation below the mean, or the mean minus one times the standard deviation. To write the code for the calculations, use your two previous variables, mean underscore overall underscore li, and std underscore overall underscore li, for the mean and standard deviation. Next, add a line of code that tells the computer to decide if each value in the overall literacy column is between the lower limit and upper limit. In other words, the lower limit is the standard deviation between the lower limit and upper limit. In other words, to decide if each value is greater than or equal to one standard deviation below the mean and less than or equal to one standard deviation above the mean. To do this, use the relational operators greater than or equal to and less than or equal to and the bitwise operator and. Finally, use the mean function to divide the number of values that are within one standard deviation of the mean by the total number of values and run the code. The output shows you that about 0.664 or 66.4% of your district literacy rates fall within one standard deviation of the mean. This is very close to the roughly 68% that the empirical rule suggests. You can use the same code structure to determine how many values of your literacy rate values fall within two and three standard deviations of the mean. Just multiply the standard deviation by two or three instead of one. About 0.954 or 95.4% of your values fall within two standard deviations of the mean. And about 0.996 or 99.6% of your values fall within three standard deviations of the mean. Your values of 66.4, 95.4, and 99.6% are very close to what the empirical rule suggests. Roughly 68, 95, and 99.7%. At this point, it's safe to say your data follows a normal distribution. Knowing that your data is normally distributed is useful for analysis because many statistical tests and machine learning models assume a normal distribution. Plus, when your data follows a normal distribution, you can use Z scores to measure the relative position of your values and find outliers in your data. Let's explore how to calculate Z scores in Python now. Recall that a Z score is a measure of how many standard deviations below or above the population mean a data point is. A Z score is useful because it tells you where a value lies in a distribution. For example, if I tell you a literacy rate is 80%, this doesn't give you much information about where the value lies in the distribution. However, if I tell you the literacy rate has a Z score of two, then you know that the value is two standard deviations above the mean. Data professionals often use Z scores for outlier deduction. Typically, they consider observations with a Z score smaller than negative 3 or larger than positive 3 as outliers. These are values that lie more than three standard deviations below or above the mean. To find outliers in your data, first create a new column called Z underscore score that includes the Z scores for each district literacy rate in your data set. Then compute the Z scores with the function stats.z score. Python takes care of all the calculations. Now write some code to identify outliers or districts with Z scores that are greater than or less than three standard deviations from the mean. Use the relational operators greater than and less than and the bitwise operator or. Using Z scores, you identify two outlying districts, district 461 and district 429. The literacy rates in these two districts are more than three standard deviations below the overall mean, which means they have unusually low rates. Your analysis gives you important information to share. The government may want to provide more funding and resources to these two districts in the hopes of significantly improving literacy. Probability distributions are useful for modeling your data and help you determine which statistical tests to use for an analysis. In addition to the normal distribution, Python can help you work with a wide range of probability distributions. You've come to the end of your introduction to probability. Wow, you've learned a lot of important concepts. Great work. Along the way, we've explored how data professionals use probability to make reasonable predictions about uncertain events and help people and organizations make data-driven decisions. Basic probability is a foundational part of data science, and it also informs more advanced statistical methods such as hypothesis testing and regression analysis, which you'll explore later in the program. In your future career as a data professional, you'll use probability distributions to discover significant patterns in your data. Plus, a working knowledge of probability distributions is key for machine learning, an essential tool in modern data science. We started off this part of the course by reviewing the two main types of probability, objective and subjective. Data professionals use objective probability to analyze and interpret data. From there, we reviewed the basic rules of probability, such as the complement, addition, and multiplication rules. Then, you learned about conditional probability, which helps you better understand the relationship between dependent events. We also discussed Bayes theorem, which updates the probability of an event based on new data about the event. After that, we moved from basic probability to probability distributions. Probability distributions describe the likelihood of the possible outcomes of a random event and can be discrete or continuous. Data professionals use probability distributions to find meaningful patterns in complex data sets. Next, we explored discrete probability distributions, such as the binomial and Poisson, and discovered how they can help you model different types of data. Then, we moved on to continuous probability distributions. We focused on the normal distribution, or bell curve, the most widely used distribution in statistics. We also discussed how Z scores can help you better understand the relationship between values in a standard normal distribution. Finally, you learned that the SciPy stats module is a powerful tool for working with probability distributions. You used the normal distribution to model your data and gain useful insights. Coming up, you have a graded assessment. To prepare, check out the reading that lists all the new terms you've learned. And feel free to revisit videos, readings, and other resources that cover key concepts. Until we meet again, good luck. Hello again. It's great to be back to continue our learning journey together. You've learned a lot so far. You now have a better understanding of the essential role statistics plays in data science. You also have a solid foundation for using descriptive statistics and basic probability to describe, analyze, and interpret data. Your knowledge of the fundamental concepts of statistics is the first step on a path that leads to more advanced methods like hypothesis testing and regression analysis. What's really exciting is that this learning journey will continue throughout your future career as a data professional. The amount of data in the world is always growing and the data career space is constantly advancing. I often read about new machine learning methods to keep up with the changes in the field and develop new skills to use at work. The next stage of your journey is all about sampling or the process of selecting a subset of data from a population. For example, if you want to survey a population of 100,000 people, you can select a representative sample of 100 people. Then you can draw conclusions about the population based on your sample data. Coming up, we'll go over how the sampling process works and how data professionals use sample data to better understand larger populations. Recall that in statistics, a population can refer to any type of data, including people, objects, events, measurements, and more. We'll start with a review of inferential statistics and examine the concept of a representative sample. Next, we'll go over the different stages of the sampling process, from choosing a target population to collecting data for your sample. Then we'll explore the two main types of sampling methods, probability sampling and non-probability sampling. We'll discuss the benefits and drawbacks of various sampling methods and describe how random sampling can help ensure that your sample is representative of the population. We'll also introduce different forms of bias in sampling, like undercoverage and non-response bias, and how they can affect non-probability sampling methods. After that, we'll explore sampling distributions, which are probability distributions for sample statistics. You'll learn about sampling distributions for both sample means and proportions, and how to estimate the corresponding values for populations. We'll also cover the central limit theorem and how it can help you estimate the population mean for different types of data sets. Finally, you'll learn how to use Python's SciPyStats module to work with sampling distributions and make a point estimate of a population mean. When you're ready, I'll meet you in the next video. Earlier in the course, we briefly discussed the difference between descriptive and inferential statistics. Descriptive statistics, like the mean and standard deviation, summarize the main features of a dataset. Inferential statistics use sample data to draw conclusions or make predictions about a larger population. Now, we're going to return to inferential statistics and explore the relationship between sample and population in more detail. This part of the course is all about sampling, the process of drawing a subset of data from a population. In this video, we'll discuss how data professionals use sampling in data science and the importance of working with a sample that is representative of the population. Data professionals use sampling to analyze many different types of data. Here are some questions that sampling has helped my data science team answer. How many products in an app store do we need to test to feel confident that all the products are secure for malware? How do we select a sample of users to run an effective AP desk for an online retail store? And how do we select a sample of customers of a video streaming service to get reliable feedback on the shows they watch? Sampling is useful in data science because selecting a sample requires less time than collecting data on every item in a population. Using a sample saves money and resources and analyzing a sample is more practical than analyzing an entire population. This is especially important in modern data analytics where you often deal with extremely large data sets. For example, let's say you want to know the percentage of people in a large city who use a laptop computer. One way to do this is to survey every resident in the city. First, it would be very difficult to access contact information for every resident of a city. Second, giving a survey to every resident of the city would be very expensive, complicated, and time consuming. Another way is to find a much smaller subset of residents and give them a survey. This subset is your sample. Then you can use the sample data you collect about laptop use to draw conclusions about the laptop use of the entire population. Collecting a sample is faster, more practical, and less expensive than collecting data on every member of the population. Keep in mind that your sample should be representative of your population. Recall that a representative sample accurately reflects the characteristics of a population. The inferences and predictions you make about your population are based on your sample data. If your sample doesn't accurately reflect your population, then your inferences will not be reliable and your predictions will not be accurate. And this can lead to negative outcomes for stakeholders and organizations. For instance, let's say you only contact computer scientists for your laptop survey. Your sample will not accurately reflect the overall population. Computer scientists are much more likely to use a laptop computer than the typical city resident. Many residents may not have access to any kind of computer or even know how to use one. A sample that only includes computer scientists is not representative. A representative sample would include people with different levels of computer knowledge and access. Let's consider another scenario. Imagine you want to find out the average height of every adult in the United States. That's a lot of people. It would take an incredible amount of time, energy, and money to even attempt to measure every person in the country. Instead, you can take a sample of 100 people and use that sample data to draw conclusions about the entire population. Now, let's say you have sample data only from professional basketball players. Pro basketball players are really tall. Some are over seven feet tall. On average, they're much taller than almost everybody else in the population. Their average height does not accurately reflect the average height of the overall population. A sample that includes only pro basketball players is not representative of every adult in the US. As a data professional, I work with sample data every day. I can tell you that having a representative sample is super important. A wise teammate of mine once said that a good model can't overcome a bad sample. And they're right. Data professionals work with powerful statistical tools that can model complex datasets and help generate valuable insights. But if the sample data you're working with does not accurately reflect your population, that is, if your sample is not representative, then it ultimately doesn't matter how good your model is. If your predictive model is based on a bad sample, then your predictions will not be accurate. Ultimately, the quality of your sample helps determine the quality of the insights you share with stakeholders. To make reliable inferences about all your customers based on feedback from a sample of customers, make sure your sample is representative of the population. As a data professional, you'll work with sample data all the time. Often, this will be sample data previously collected by other researchers. Sometimes, your team may collect their own data. Either way, it's important to know how the sampling process works because it directly affects the quality of your sample data. The sampling process helps determine whether your sample is representative of the population and whether your sample is unbiased. If you estimate the mean height of a country's total adult population based on a sample of professional basketball players, your estimate will not be accurate. In this video, we'll go over the main stages of the typical sampling process. This will give you a useful framework for understanding how sampling is conducted and how the sampling process can affect your sample data. To get a clear overview of the sampling process, let's divide it into five steps. Step one, identify the target population. Step two, select the sampling frame. Step three, choose the sampling method. Step four, determine the sample size. And step five, collect the sample data. As an example, let's consider a public opinion poll. Imagine the city government of Vancouver, Canada wants to build a new Savoy system. The public will vote on whether or not to move forward with the project. The city government wants to find out if there is public support for the project. They ask you to take a poll and estimate the percentage of adult residents that support the project. Legal adults are 18 years or older. The first stage in the sampling process is to identify your target population. The target population is the complete set of elements that you're interested in knowing more about. In this case, the target population includes every resident in the city who is 18 years or older and eligible to vote. Let's say that the city contains 100,000 such residents. Since it's too difficult and too expensive to survey everyone in the target population, you decide to take a sample. The next step in the sampling process is to create a sampling frame. A sampling frame is a list of all the items in your target population. Basically, it's a complete list of everyone or everything you're interested in researching. The difference between a target population and a sampling frame is that the population is general and the frame is specific. So if your target population is 100,000 city residents who are 18 years or older and eligible to vote, your sampling frame could be a list of names for all these residents, from Alana Aoki to Zoe Zappa. For practical reasons, your sampling frame may not accurately match your target population because you may not have access to every member of the population. For instance, the city may not have reliable contact information about each resident or perhaps not all eligible voters are actually registered to vote. So their opinions about the potential subway system aren't relevant, since the project will be decided by an election. For reasons like these, your sampling frame will not exactly overlap with your target population. Your sampling frame will include the list of residents 18 or over that you're able to obtain useful information about. So the sampling frame is the accessible part of your target population. Next, you need to choose a sampling method, which is step three of the sampling process. One way to help ensure that your sample is representative is to choose the right sampling method. There are two main types of sampling methods, probability sampling and non-probability sampling. In later videos, we'll explore the specifics in more detail. For now, just know that probability sampling uses random selection to generate a sample. Non-probability sampling is often based on convenience or the personal preferences of the researcher rather than random selection. Because probability sampling methods are based on random selection, every person in the population has an equal chance of being included in the sample. This gives you the best chance to get a representative sample as your results are more likely to accurately reflect the overall population. So, assuming you have the budget and the time, you can use a probability sampling method for your poll about the subway project. Using random selection gives you the best chance of getting a sample that's representative of your population. Step four of the sampling process is to determine the best size for your sample, since you don't have the resources to pull everyone in your sampling frame. In statistics, sample size refers to the number of individuals or items chosen for a study or experiment. Sample size helps determine the accuracy of the predictions you make about the population. In general, the larger the sample size, the more accurate your predictions. Based on the desired level of accuracy for your survey, you can decide how many eligible voters to include in your sample. Now, you're ready to collect your sample data. This is the final step in the sampling process. To pull the residents selected for your sample, you decide to conduct a survey. Based on the survey responses, you determine the percentage of eligible voters 18 and over who favor the proposed subway project. Then you share this information with city leaders to help them make a more informed decision. Effective sampling ensures that your sample data is representative of your target population. Then when you use sample data to make inferences about the population, you can be reasonably confident that your inferences are reliable. Your poll will give Cedar leaders a better idea of public support for the new subway and help inform future decisions about the project. The decisions you make at each step of the sampling process can affect the quality of your sample data. Understanding the sampling process will make you a better data professional, whether you're analyzing data collected by the researchers or conducting a survey on your own. In a previous video, you learned the differences between probability and non-probability sampling. Then you conducted a survey using probability sampling, which is the third step of the sampling process. In this video, you will learn more about the different methods of probability sampling. Then we'll discuss the benefits and drawbacks of each method. There are four different probability sampling methods, simple random sampling, stratified random sampling, cluster random sampling, and systematic random sampling. In a simple random sample, every member of a population is selected randomly and has an equal chance of being chosen. You can randomly select members using a random number generator or buy another method of random selection. For example, say you want to survey the employees of a company about their work experience. The company employs 1,000 people. You can assign each employee in the company database a number from one to 1,000, and then use a random number generator to select 100 people for your sample. The main benefit of simple random samples is that they're usually fairly representative. Since every member of the population has an equal chance of being chosen. Random samples tend to avoid bias, and surveys like these give you more accurate results. However, in practice, it's often expensive and time-consuming to conduct large, simple random samples. And if your sample size is not large enough, a specific group of people in the population may be underrepresented in your sample. If you use a larger sample size, your sample will more accurately reflect the population. The next method of probability sampling is a stratified random sample. In a stratified random sample, you divide a population into groups, and randomly select some members from each group to be in the sample. These groups are called strata. Strata can be organized by age, gender, income, or whatever category you're interested in studying. For example, say you want to survey high school students about how much time they spent studying on weekends. You might divide the student population according to age, 14, 15, 16, and 17-year-olds. Then you can survey an equal number of students from each age group. Stratified random samples help ensure that members from each group in the population are included in the survey. This method allows you to draw more accurate conclusions about the relevant groups. For instance, 14-year-olds and 17-year-olds may have different perspectives about studying on the weekends. Older students can drive, and may have more social activities or work on the weekends. Stratified sampling will capture both perspectives. One main disadvantage of stratified sampling is that it can be difficult to identify appropriate strata for a study if you lack knowledge of a population. For example, if you want to study median income among a population, you may want to stratify your sample by job type, or industry, or location, or education level. If you don't know how relevant these categories are to median income, it will be difficult to choose the best one for your study. The next method of probability sampling is a cluster random sample. When you're conducting a cluster random sample, you divide a population into clusters. Randomly select certain clusters and include all members from the chosen clusters in the sample. Cluster sampling is similar to stratified random sampling. But in stratified sampling, you randomly choose some members from each group to be in the sample. In cluster sampling, you choose all members from a group to be in the sample. Clusters are divided using identifying details, such as age, gender, location, or whatever you want to study. For example, imagine you want to conduct a survey of employees at a global company using cluster sampling. The company has 10 offices in different cities around the world. Each office has about the same number of employees in similar job roles. You randomly select three offices in three different cities as clusters. You include all the employees at the three offices in your sample. One advantage of this method is that a cluster sample gets every member from a particular cluster, which is useful when each cluster reflects the population as a whole. This method is helpful when dealing with large and diverse populations that have clearly defined subgroups. If researchers want to learn about the preferences of primary school students in Oslo, Norway, they can use one school as a representative sample of all schools in the city. A main disadvantage of cluster sampling is that it may be difficult to create clusters that accurately reflect the overall population. For example, for practical reasons, you may only have access to the offices in the United States when your company has locations all over the world. And employees in the United States may have different characteristics and values than employees in other countries. The final method of probability sampling is a systematic random sample. In systematic random samples, you put every member of a population into an ordered sequence. Then you choose a random starting point in the sequence and select members for your sample at regular intervals. Let's assume you want to survey students at a community college. For a systematic random sample, you put the students' names in alphabetical order, randomly choose a starting point and pick every fifth name to be in the sample. Systematic random samples are often representative of the population since every member has an equal chance of being included in the sample. Whether the student's last name starts with B or R isn't going to affect their characteristics. Systematic sampling is also quick and convenient when you have a complete list of the members of your population. One disadvantage of systematic sampling is that you need to know the size of the population that you want to study before you begin. If you don't have this information, it's difficult to choose consistent intervals. The four methods of probability sampling we've covered, simple, stratified, cluster, and systematic are all based on random selection, which is the preferred method of sampling for most data professionals. These methods can help you create a sample that is representative of the population. In an upcoming video, we'll check out some methods of non-probability sampling and why they're not considered representative. In my work as a data professional, I often use sample data to help build machine learning models. Today, a machine learning model may help determine if a person gets an approval for a loan, an interview for a job, or an accurate medical diagnosis. Models based on representative samples are much more likely to make fair and unbiased decisions about who gets a loan or a job interview. Using samples that are representative of the different types of people in the population helps ensure that each person receives the treatment that is best for them. Unfortunately, bias can affect sample data. Sampling bias occurs when a sample is not representative of the population as a whole. To eliminate bias, I try to use samples that are representative of the overall population. The consequences of drawing conclusions from a non-representative sample can be serious. Recently, you learned that probability sampling methods use random selection, which helps avoid sampling bias. A randomly chosen sample means that all members of the population have an equal chance of being included. In contrast, non-probability sampling methods do not use random selection. So they do not typically generate representative samples. In fact, they often result in biased samples. However, non-probability sampling is often less expensive and more convenient for researchers to conduct. Sometimes, due to budget, time, or other reasons, it's just not possible to use probability sampling. Plus, non-probability methods can be useful for exploratory studies, which seek to develop an initial understanding of a population, not draw conclusions or make predictions about the population as a whole. In this video, we'll discuss four methods of non-probability sampling and learn how sampling bias can affect each method. These four methods are convenient sampling, voluntary response sampling, snowball sampling, and purposive sampling. Let's start with convenient sampling. In this method, you choose members of a population that are easy to contact or reach. As the name suggests, conducting a convenient sample involves collecting a sample from some more convenient to you, such as your workplace, a local school, or a public park. For example, to conduct an opinion poll, a researcher might stand in front of a local high school during the day and poll people that happen to walk by. Because these samples are based on convenience to the researcher and not a broader sample of the population, convenient samples often show under-coverage bias. Under-coverage bias occurs when some members of a population are inadequately represented in the sample. For instance, people who don't work at or attend the school will not be represented as much in the sample. The next method of non-probability sampling is voluntary response sampling. This type of sample consists of members of a population who volunteer to participate in a study. For example, the owners of a restaurant want to know how people feel about their dinner options. They ask their regular customers to take an online survey about the quality of the restaurant's food. Voluntary response samples tend to suffer from non-response bias, which occurs when certain groups of people are less likely to provide responses. People who voluntarily respond will likely have stronger opinions, either positive or negative, than the rest of the population. This makes the volunteer customers at the restaurant an unrepresentative sample. The next non-probability sampling method is snowball sampling. In a snowball sample, researchers recruit initial participants to be in a study and then ask them to recruit other people to participate in the study. Like a snowball, the sample size gets bigger and bigger as more participants join in. For example, if a study was investigating cheating among college students, potential participants might not want to come forward. But if a researcher can find a couple of students willing to participate, these two students may know others who have also cheated on exams. The initial participants could then recruit others by sharing the benefits of the study and reassuring them of confidentiality. Although it may seem convenient that study participants help build the sample, this type of recruiting can lead to sampling bias. Because initial participants recruit additional participants on their own, it's likely that most of them will share similar characteristics. And these characteristics might be unrepresentative of the total population under study. In purpose of sampling, researchers select participants based on the purpose of their study. Because participants are selected for the sample according to the needs of the study, applicants who do not fit the profile are rejected. For example, a researcher wants to survey students on the effectiveness of certain teaching methods at their university. The researcher only wants to include students who regularly attend class and have an established record of academic achievement. So they select the students with the highest grade point averages to participate in the study. In purpose of sampling, the researcher often intentionally excludes certain groups from the sample to focus on a specific group they think is most relevant to their study. In this case, the researcher excludes students who don't have high grade point averages. This could lead to biased outcomes because the students in the sample are not likely to be representative of the overall student population. As a data professional, you have to think about bias and fairness from the moment you start collecting sample data to the time you present your conclusions. Once you become aware of some common forms of bias, you can remain on the alert for bias in any form. In previous videos, you learned how the sampling process works and the benefits and drawbacks of various sampling methods. As a data professional, I often work with sample data to make informed predictions about future sales revenue or product performance. Understanding how sampling affects your data, both positively and negatively, will be important in your future career in data analytics. For instance, one way data professionals use sample statistics is to estimate population parameters. As you may recall, a statistic is a characteristic of a sample and a parameter is a characteristic of a population. For example, the mean weight of a random sample of 100 penguins is a statistic. The mean weight of the total populations of 10,000 penguins is a parameter. A data professional might use the mean weight of the sample of 100 penguins to estimate the mean weight of the population. This type of estimate is called a point estimate. A point estimate uses a single value to estimate a population parameter. In this video, we'll discuss the concept of sampling distribution and how it can help you represent the possible outcomes of a random sample. You'll also learn how the sampling distribution of the sample mean can help you make a point estimate of the population mean. A sampling distribution is a probability distribution of a sample statistic. Recall that a probability distribution represents the possible outcomes of a random variable, such as a coin toss or a die roll. In the same way, a sampling distribution represents the possible outcomes for a sample statistic, like the mean. Imagine you take repeated simple random samples of the same size from a population. Since each sample is random, the mean value will vary from sample to sample in a way that cannot be predicted with certainty. Right now, this may seem a bit abstract. To get a better idea of the sampling distribution of the mean, let's continue with our penguin example. Imagine you're setting a population of 10,000 blue penguins, which are the smallest of all known penguin species. You want to find out the mean weight of a blue penguin in this population. Since it would take too long to locate and weight every single penguin, you instead collect sample data from the population. Let's say you take repeated simple random samples of 10 penguins each from the population. In other words, you randomly choose 10 penguins from the group, weigh them, and then repeat this process with a different set of 10 penguins. For your first sample, you find the mean weight of the 10 penguins is 3.1 pounds. For your second sample, the mean weight of the 10 penguins is 2.9 pounds. For your third sample, the mean weight is 2.8 pounds, and so on. Imagine that the true mean weight of a penguin in this population is three pounds. Although in practice, you wouldn't know this unless you weighed every single penguin. Each time you take a sample of 10 penguins, it's likely that the mean weight of the penguins in your sample will be close to the population mean of three pounds, but not exactly three pounds. Every once in a while, you may get a sample full of smaller than average penguins with a mean weight of 2.5 pounds or less, or you might get a sample full of larger than average penguins with a mean weight of 3.5 pounds or more. The mean weight will vary randomly from sample to sample. Sampling variability refers to how much an estimate varies between samples. You can use a sampling distribution to represent the frequency of all your different sample means. I find that it helps to visualize these samples as a histogram. Let's plot 10 simple random samples of 10 penguins each. The most frequently occurring sample means will be around three pounds. The least frequent sample means will be the more extreme weights, such as 2.3 or 3.7 pounds. As you increase the size of a sample, the mean weight of your sample data will get closer to the mean weight of the population. If you sampled the entire population, in other words, if you actually weighed all 10,000 penguins, your sample mean would be the same as your population mean. But to get an accurate estimate of the population mean, you don't have to weigh 10,000 penguins. If you take a large enough sample size from a population, say 100 penguins, your sample mean will be an accurate estimate of the population mean. This point is based on the central limit theorem, which we'll explore in more detail later on in the course. For now, just know that if your sample is large enough, your sample mean will roughly equal the population mean. For instance, imagine you collect a sample of 100 penguins and find that the mean weight of your sample is three pounds. This means that your best estimate for the mean weight of the entire penguin population is also three pounds. You can also use your sample data to estimate how accurately the mean weight of any given sample represents the population mean weight. This is useful to know because the mean varies from sample to sample. And any given sample is not necessarily an exact reflection of the population mean. For example, the true mean weight for the penguin population might be three pounds. The mean weight for any given sample of penguins might be 3.3 pounds, 2.8 pounds, 2.4 pounds, and so on. The more variability in your sample data, the less likely it is that the sample mean is an accurate estimate of the population mean. Data professionals use the standard deviation of the sample means to measure this variability. Recall that the standard deviation measures the variability of your data or how spread out your data values are. The more spread between the data values, the larger the standard deviation. In statistics, the standard deviation of a sample statistic is called the standard error. The standard error of the mean measures variability among all your sample means. A larger standard error indicates that the sample means are more spread out or that there's more variability. A smaller standard error indicates that the sample means are closer together or that there's less variability. The less standard error, the more likely it is that your sample mean is an accurate estimate of the population mean. For example, say you take three random samples of 10 penguins each. The mean weight of the first sample is 3.3 pounds. The second is 3.1 pounds and the third is 2.9 pounds. There's not much variability among these three sample means. The values are all close together. The standard error will be relatively small. Now, say you take another three random samples of 10 penguins each. The mean weight of the first sample is 2.2 pounds. The second is 3.2 pounds and the third is 4.2 pounds. There's more variability among these three sample means. The values are more spread out. The standard error will be relatively large. Note that the concept of standard error is based on the practice of repeated sampling. In reality, researchers usually work with a single sample. It's often too complicated, expensive or time consuming to take repeated samples of a population. Instead, statisticians have derived a formula for calculating the standard error based on the mathematical assumption of repeated sampling. You can use the following formula to calculate the standard error of the sample mean. S divided by the square root of N, where S is the sample standard deviation and N is the sample size. For example, in your study of penguin weights, imagine that a sample of 100 penguins has a mean weight of three pounds and a standard deviation of one pound. You can calculate the standard error by dividing the sample standard deviation, one, by the square root of the sample size, 100. One divided by the square root of 100 equals 0.1. This means that your best estimate for the true population mean weight of all penguins is three pounds. But you should expect that the mean weight from one sample to the next will vary with a standard deviation of about 0.1 pounds. As your sample size gets larger, your standard error gets smaller. This is because standard error measures the difference between your sample mean and the actual population mean. As your sample gets larger, your sample mean gets closer to the actual population mean. The more accurate the estimate of the population mean, the smaller the standard error. Say you collect a sample of 10,000 penguins instead of 100 penguins. You find that the sample mean weight is three pounds and the sample standard deviation is one pound. The standard error is one divided by the square root of 10,000, which equals 0.01. Your best estimate for the sample mean will still be three pounds. But now you can expect that the mean weight from one sample of penguins to the next will vary with a standard deviation of just 0.01 pounds. In general, you can have more confidence in your estimates as the sample size gets larger and the standard error gets smaller. This is because the mean of your sampling distribution gets closer to the population mean. Coming up, we'll explore this idea further when we talk about the central limit theorem. Recently, we talked briefly about the central limit theorem and the relationship between sample size and the sample mean. Data professionals use the central limit theorem to estimate population parameters for data in economics, science, business, and other fields. For example, they may use the theorem to estimate the mean annual household income for an entire city or country. They mean height and weight for an entire animal or plant population, or the mean commute time for all the employees of large corporation. In this video, you'll learn more about the central limit theorem and how it can help you estimate the population mean for different types of data. The central limit theorem states that the sampling distribution of the mean approaches a normal distribution as the sample size increases. In other words, as your sample increases, your sampling distribution assumes the shape of a bell curve. And if you take a large enough sample of the population, the sample mean will be roughly equal to the population mean. For example, say you want to estimate the average height for a university student in South Africa. Instead of measuring millions of students, you can get data on a representative sample of students. If your sample size is large enough, the mean height of your sample will be roughly equal to the mean height of the population. There is no exact rule for how large a sample size needs to be in order for the central limit theorem to apply. In general, a sample size of 30 or more is considered sufficient. Exploratory data analysis can help you determine how large of a sample is necessary for a given data set. What's really powerful about the central limit theorem is that it holds true for any population. You don't need to know the shape of your population distribution in advance to apply the theorem. If you collect a large enough sample, the shape of your sampling distribution will follow a normal distribution. This pattern is true even if your population has a skewed distribution. For example, here's a graph for annual household income in the US for the year 2010. The x-axis represents annual income and the y-axis represents the percent of households that have that income. You may notice how the data is skewed to the right and the shape of the distribution is far from normal. The distribution for annual income is skewed because of the extraordinarily high incomes of the wealthiest households. However, if you sample incomes at random among all households and take a large enough sample, your sampling distribution will follow a normal distribution. This is true even though the population distribution, every US household is not normal. And the mean income of your sampling distribution will give you an accurate estimate of the mean income of the entire population. Let's check out another example. Imagine you're studying the population of coffee drinkers in the United States. You want to know the average amount of coffee each person drinks per day, but you don't have the time or money to survey every single coffee drinker in the US, which, by the way, is around 150 million people. Instead of surveying the entire population, you collect repeated random samples of 100 coffee drinkers. Using this data, you calculate the mean amount of coffee consumed per day for your first sample, 22.5 ounces. For your second sample, the mean amount is 28.2 ounces. You take a third sample, the mean amount is 25.4 ounces and so on. In theory, you could take 10, 50, or 100 samples and keep increasing the sample size until you've surveyed all 150 million people about their coffee consumption. The central limit theorem says that as your sample size increases, the shape of your sampling distribution will increasingly resemble a bell curve. In practice, the specific sample size you choose will depend on factors like budget, time, resources, and the desired level of confidence for your estimate. If you take a large enough sample from the population, the mean of your sampling distribution will equal the population mean. From this sample of the population, you can accurately estimate the average amount of coffee consumed per day for the entire population. In case you're wondering, the average American drinks around 24 ounces or three cups of coffee per day. Based on what I've noticed, if we took a sample of only data professionals, the mean value might be even higher. Whether you're measuring coffee consumption or household income, the central limit theorem is a useful method for better understanding the distribution of your data. In this part of the course, we've been talking about how data professionals use sample statistics to estimate population parameters. Recently, you learned how to use the sampling distribution of the mean to estimate the actual population mean. For example, you might estimate the mean weight of an animal population or the mean salary of all the people who work in the hospitality industry. Data professionals also use sampling distributions to estimate population proportions. In statistics, a population proportion refers to the percentage of individuals or elements in a population that share a certain characteristic. Proportions measure percentages or parts of a whole. For example, you might survey 100 employees at a large company to estimate what percentage of all employees like the food at the office cafeteria. Data professionals might also use the sampling distribution of the proportion to estimate the proportion of all visitors to a website who make a purchase before leaving. Assembly line products that meet quality control standards are voters who support a candidate in an upcoming election. In this video, you'll learn about the sampling distribution of the sample proportion and how it can help you estimate the population proportion. Imagine you work for a market research firm. Your client is a company that manufactures sneakers and wants to make sure their sneakers appeal to the largest audience. You're asked to research sneaker preferences among residents of Santiago, Chile, who are between 16 and 19 years old. There are 100,000 teenagers in that age group. You might want to find out what proportion of this population prefers slip-on sneakers over sneakers with shoelaces. Since it would take too long to locate and survey all 100,000 teens, you instead collect sample data from the population. Let's say you take repeated simple random samples of 100 teenagers from the overall population. In your first sample, you find that 12% of teenagers prefer slip-on sneakers. In your second sample, you find that 8% prefer slip-on sneakers. You take a third sample and the proportion is 11%. Earlier, we talked about sampling variability for the sample means or how the value of the mean varies from one sample to the next. The same holds true for proportions. Let's assume we know that 10% of teenagers in the total population prefer slip-on sneakers. In most of the samples, the proportion of teenagers who prefer slip-on sneakers will be close to the true population proportion of 10%, but not exactly 10%. Occasionally, a sample may turn out to have a proportion that's very small or very large. You can use a sampling distribution to represent the frequency of all your different sample proportions. For instance, if you take 10 simple random samples of 100 teenagers from this population, you can show the sampling distribution of the proportion in histogram. The most frequently occurring values in your sample data will be around 10%. The values that occur least frequently will be the more extreme proportions, such as 5% or 15%. As with the sample means, the central limit theorem also applies to your sample proportions. As your sample size increases, the distribution of the sample proportion will be approximately normal. The overall average or mean proportion is located in the center of the curve. If you take a sufficiently large enough sample of teenagers, the sample proportion will be an accurate estimate of the true population proportion. If you survey 1,000 teenagers and find that 10% prefer slip-on sneakers, this means that your best estimate for the proportion of all teenagers who prefer slip-ons is also 10%. As with the sample mean, you can use the standard error of the proportion to measure sampling variability. This tells you how much a particular sample proportion is likely to differ from the two population proportion. This is useful to know because the proportion varies from sample to sample and any given sample proportion probably won't be exactly equal to the true population proportion. The true proportion of teenagers who prefer slip-on sneakers might be 10%, but the proportion of any given sample might be 12%, 9%, 7%, and so on. The more variability in your sample data, the less likely it is that the sample proportion is an accurate estimate of the population proportion. It's important to understand the accuracy of your estimate because stakeholder decisions are often based on the estimates you provide. You can use the following formula to calculate the standard error of the proportion. The square root of p hat, open parentheses, one minus p hat, closed parentheses, divided by n. P hat refers to the population proportion and n refers to the sample size. In statistics, you say hat when you refer to the carat symbol above the letter p. The formula is the square root of p hat multiplied by one minus p hat divided by n. For example, suppose you survey 100 teenagers about their sneaker preferences and find that your estimate for the population proportion of teens who prefer slip-on sneakers is 10% or 0.1. In this case, p hat is 0.1 and n is 100. When you plug in the numbers into the formula for standard error of the proportion, it equals 0.03. As your sample size gets larger, your standard error gets smaller because standard error measures the difference between your sample proportion and the true population proportion. As your sample gets larger, your sample proportion gets closer to the true population proportion. The more accurate the estimate of the population proportion, the smaller the standard error. Your estimate will help stakeholders at the sneaker company make decisions about product development. Based on your results, they may want to put less money into developing slip-on sneakers. Typically, the next step for a data professional would be to use the standard error to construct a confidence interval. This describes the uncertainty of your estimate and gives your stakeholders more detailed information about your results. Later on in this course, you'll learn how to calculate and interpret confidence intervals to more accurately predict preferences of a population. Earlier, we talked about how data professionals use sample data to make point estimates about population parameters. For instance, if you want to know the average age of registered voters in Japan, you could take a survey of 100 registered voters. Then you could use the average age of the survey respondents as a point estimate of the average age of all registered voters. If your sample size is large enough, your sample mean will give you a pretty good estimate of the population mean. In this video, you'll use Python to simulate random sampling. Then, based on your sample data, you'll make a point estimate of a population mean. We'll continue with our previous scenario in which you're a data professional working for the Department of Education of a large nation. Recall that you're analyzing data on the literacy rate for each district. You'll continue to use the dataset you worked with before. If you need to access the data, do so now. For this video, we'll make a slight change to our story. Imagine that you are asked to collect the data on district literacy rates and that you have limited time to do so. You can only survey 50 randomly chosen districts instead of 634 districts included in your original dataset. The goal of your research study is to estimate the mean literacy rate for all 634 districts based on your sample of 50 districts. You can use Python to simulate taking a random sample of 50 districts from your dataset. Now, let's open up a Jupyter notebook and get to work. To start, import the Python packages you will use, NumPy, Pandas, and statsmodels.api. And the library you will use, matplotlib.pyplot. To save time, rename each package in library with an abbreviation, np, pd, plt, and sm. To load the scipy stats module, write from it scipy import stats. First, you'll want to get a random sample of 50 districts. A cool feature of Python is that you can use code to simulate random sampling and choose the desired sample size. To do this, use the sample function in Pandas. Before you write the code, let's review the function and its arguments. To simulate random sampling, use the following arguments in the sample function, n, replace, and random state. n refers to the desired sample size, replace indicates whether you are sampling with or without replacement. Random underscore state refers to the seed of the random number. Let's explore each argument in more detail. First, sample size or the number or items in your sample. In this case, you want to take a random sample of 50 district literacy rates from the overall literacy column. Second, replacement. In general, you can sample with or without replacement. When a population element can be selected more than one time, you are sampling with replacement. When a population element can be selected only one time, you are sampling without replacement. For example, suppose you have a jar that contains 100 unique numbers from one to 100. You want to select a random sample of numbers from the jar. After you pick a number from the jar, you can put the number aside or you can put it back in the jar. If you put the number back in the jar, it may be selected more than once. This is sampling with replacement. If you put the number aside, it can be selected only one time. This is sampling without replacement. For the purpose of our example, you will sample with replacement. The final part of the code is random underscore state or the seed of the random number. A random seed is a starting point for generating random numbers. You can use any arbitrary number to fix the random seed and give the random number generator a starting point. Also, going forward, you can use the same random seed to generate the same set of numbers. In a later video, you'll work with this sample again. Now you're ready to write your code. First, name a new variable sampled underscore data. Then set the arguments for the sample function. N sample size equals 50. Replace equals true because you're sampling with replacement. For random state, choose an arbitrary number for your random seed. How about 31,208? Now, display the value of your variable. The output shows 50 districts selected randomly from your data set. Each has a different literacy rate. Now that you have your random sample, use the mean function to compute the sample mean. First, name a new variable estimate one. Next, use the mean function to compute the mean for your sample data. The sample mean for district literacy rate is about 74.22%. This is a point estimate of the population mean based on your random sample of 50 districts. Remember that the population mean is the literacy rate for all districts. Due to sampling variability, the sample mean is usually not exactly the same as the population mean. Next, let's find out what will happen if you compute the sample mean based on another random sample of 50 districts. To generate another random sample, name a new variable estimate two. Then set the arguments for the sample function. Once again, n is 50 and replace is true. This time, choose a different number for your random seed to generate different sample. How about 56,810? Finally, add the mean function at the end of your line of code to compute the sample mean. Display the value of your variable. For your second estimate, the sample mean for your district literacy rate is about 74.24%. Due to sampling variability, this sample mean is different from the sample mean of your previous estimate, 74.22%, but they're really close. Recall that the central limit theorem tells you that when the sample size is large enough, the sample mean approaches a normal distribution. And as you sample more observations from a population, the sample mean gets closer to the population mean. The larger your sample size, the more accurate your estimate of the population mean is likely to be. In this case, the population mean is the overall literacy rate for all districts in the nation. In a previous video, you found that the population mean literacy rate is 73.39%. Based on sampling, your first estimated sample mean was 74.22%, and your second estimate was 74.24%. Each estimate is relatively close to the population mean. Now, imagine you repeated the study 10,000 times and obtained 10,000 point estimates of the mean. In other words, you take 10,000 random samples of 50 districts and compute the mean for each sample. According to the central limit theorem, the mean of your sampling distribution will be roughly equal to the population mean. You can use Python to compute the mean of the sampling distribution with 10,000 samples. Let's go over the code step by step. First, create an empty list to store the sample mean from each sample. Name this estimate underscore list. Second, set up a for loop with the range function. The range function generates a sequence of numbers from one to 10,000. The loop will run 10,000 times and iterate over each number in the sequence. Third, specify what you want to do in each iteration of the loop. The sample function tells the computer to take a random sample of 50 districts with replacement. The argument n equals 50 and the argument replace equals true. The append function adds a single item to an existing list. In this case, it appends the value of the sample mean to each item in the list. Your code generates a list of 10,000 values, each of which is the sample mean from a random sample. Next, create a new data frame for your list of 10,000 estimates. Name a new variable estimate underscore DF to store your data frame. Now, name a new variable mean underscore sample underscore means. Then compute the mean for your sampling distribution of 10,000 random samples. Display the value of your variable. The mean of your sampling distribution is about 73.41%. This is essentially identical to the population mean of your complete data set, which is about 73.4%. To visualize the relationship between your sampling distribution of 10,000 estimates and the normal distribution, we can plot both at the same time. For now, don't worry about the code as it's beyond the scope of this course. I want to share three takeaways from this graph. First, as the central limit there predicts, the histogram of the sampling distribution is well approximated by the normal distribution. The outline of the histogram closely follows the normal curve. Second, the mean of the sampling distribution, the blue dotted line, overlaps with the population mean, the green solid line. This shows that the two means are essentially equal to each other. Third, the sample mean of your first estimate of 50 districts, the red dashed line, is farther away from the center. This is due to sampling variability. The central limit theorem shows that as you increase the sample size, your estimate becomes more accurate. For a large enough sample, the sample mean closely follows a normal distribution. Your first sample of 50 districts estimated the mean district literacy rate as 74.22%, which is relatively close to the population mean of 73.4%. To ensure your estimate will be useful to the government, you can compare the nation's literacy rate to other benchmarks, such as the global literacy rate or the literacy rate of pure nations. If the nation's literacy rate is below these benchmarks, this may help convince the government to devote more resources to improving literacy across the country. Estimating population parameters through sampling is a powerful form of statistical inference. When you're dealing with large numbers and complex calculations, Python helps you quickly make accurate estimates. You now have a solid foundation in sampling, which will serve you well in your future role as a data professional. In the data career space, you'll be working with sample data all the time. Throughout this part of the course, we've explored how data professionals use sample data to make inferences, predictions, and estimates about populations. Sampling is so useful because it's often too expensive, time-consuming, or complicated to collect data for an entire population. And sometimes a complete data set may be too large to process, even for a computer. Effective sampling is especially important in modern data analytics because data professionals often manage extremely large data sets. For example, you might work with economic data that has tens of millions of data points and need to use a sample of 10,000. As a working data professional, it's important to understand the sampling process used to generate your sample data and whether or not your sample is representative of your population. Plus, as you now know, different types of bias affect different sampling methods. Early on, we reviewed the different stages of the sampling process from choosing a target population to collecting data for your sample. Then we discussed the two main types of sampling methods, probability sampling and non-probability sampling. We went over the benefits and drawbacks of each method and how random sampling can help ensure that your sample is high quality and representative of your population. We also discussed different forms of bias in sampling and how bias affects non-probability sampling methods. You learn that any insights you draw from biased data may not be accurate or useful to your stakeholders. After that, you learned about sampling distributions for both sample means and proportions and how to estimate the corresponding population parameters. We also covered the central limit theorem and how it helps you estimate the population mean for many different types of data sets. Finally, you learned how to use Python's SciPy stats module to work with sampling distributions and make a point estimate of a population mean. Coming up, you'll take a graded assessment to prepare, check out the reading that lists all the new terms you've learned and feel free to revisit videos, readings and other resources that cover key concepts. Congratulations on your progress. Let's keep it going. Welcome back. Wow, you've come so far on your learning journey and you've picked up a lot of new stats knowledge along the way. So far in this course, you've learned how data professionals use descriptive statistics to summarize and explore their data and how they use inferential statistics to draw conclusions about their data. You're familiar with basic rules of probability like the addition and multiplication rules and how they describe the likelihood of random events. You also know how probability distributions like the binomial, Poisson and normal distributions can help you model different types of data. Recently, you also learned about the main stages of the sampling process and the benefits and drawbacks of using different sampling methods. Finally, you've learned how data professionals use sampling distributions to estimate means and proportions. In this part of the course, we'll explore how to construct and interpret a confidence interval. A confidence interval is a range of values that describes the uncertainty surrounding an estimate. In stats and data science, there are different ways to describe the uncertainty of an estimate. Two of the main ways are confidence intervals and credible intervals. These concepts correspond to two different ways of thinking about statistics. Frequentist and Bayesian. Confidence intervals are a frequentist concept. Credible intervals are a Bayesian concept. While the goal of confidence and credible intervals is similar, they have different statistical definitions and technical procedures. Right now, you don't need to worry about the details. I just want you to be aware of the broader context of different stats methods and the tools data professionals use to analyze and interpret data. Today, there's a lively debate among statisticians, researchers, and data professionals about how to apply and interpret confidence intervals. While the nuances of this debate are beyond the scope of this course, you may want to learn more about confidence intervals as you pursue your career in data analytics. Whether or not you join this ongoing conversation, it's important to know how to construct and interpret a confidence interval for at least two reasons. First, many data professionals use confidence intervals regularly as part of their job, and it may soon be a part of yours. Second, there's a good chance you may be asked about confidence intervals in a future job interview, so it's essential to have a foundation in the topic. Coming up, we'll discuss the importance of confidence intervals and data-driven work and how they can help you describe the uncertainty of an estimate. For example, data professionals might use a confidence interval to describe the uncertainty of an estimate for the average return on investment for a stock portfolio, the average maintenance costs for factory machinery, the percentage of customers who will register for a rewards program, and the percentage of website visitors who will click on an ad. However, confidence intervals are often misinterpreted, which can lead to false conclusions in a study, so you'll also learn how to correctly interpret confidence intervals and how to avoid common mistakes. We'll go over the procedure for constructing a confidence interval from identifying a sample statistics and choosing a confidence level to finding the margin of error and calculating the interval. Then, you'll learn how to construct confidence intervals for both means and proportions. Finally, you'll learn how to use Python's sci-pi stats module to construct a confidence interval for a point estimate of a population mean. When you're ready to learn more, I'll meet you in the next video. Earlier, we talked about how data professionals make point estimates about population parameters. For example, based on a sample of 100 penguins, a data professional might estimate that the mean weight of a population of 10,000 penguins is 31 pounds, or based on a pool of 100 voters, a data professional might estimate that 55% of all 100,000 voters prefer a certain candidate in an upcoming election. A point estimate uses a single value to estimate a population parameter. In contrast, an interval estimate uses a range of values to estimate a population parameter. A confidence interval is a type of interval estimate. For example, for penguin weight, you might construct a 95% confidence interval between 28 and 32 pounds, or for the election pool, you might construct a 99% confidence interval between 51 and 57%. In this video, we'll go over the main components of a confidence interval and discuss how confidence intervals help you express the uncertainty of an estimate. Typically, data professionals use confidence intervals rather than point estimates to share their results. A point estimate can be useful, but a single value like 30 pounds does not express the uncertainty built into any estimate. This uncertainty is due to the method of random sampling. For the purpose of our example, let's imagine that the mean weight of all 10,000 penguins is 31 pounds, although you wouldn't know this unless you weighed every penguin. In practice, data professionals usually select one random sample because repeated random sampling is often expensive and time consuming. Since the sample is random, the mean of any given sample will likely not be equal to the actual population mean. For example, you may happen to weigh a sample of penguins that have recently struggled to find food, so they only weigh 28 pounds, or you may weigh a sample of penguins that recently fed on a fish buffet and are above average at 32 pounds. Either way, your sample estimate will not equal the population mean of 31 pounds. So, if you only provide a sample statistic or point estimate, it won't be as accurate. Confidence intervals give data professionals a way to express the uncertainty caused by randomness and provide a more reliable estimate. Along with the sample statistic, a confidence interval includes a margin of error and a confidence level. Let's explore our penguin example to get a better idea of each component. We'll start with our sample statistic. The sample mean of our group of penguins is 30 pounds. Next, we'll determine the interval for our estimate, which is defined by the sample statistic plus or minus the margin of error. The margin of error represents the maximum expected difference between a population parameter and a sample estimate. In other words, this is the amount that a data professional expects their estimate might vary from their actual amount. So, if our sample stat for our penguins is 30 pounds and our margin of error is plus or minus two pounds, that means that the lower limit of the interval is 28 pounds and the upper limit is 32 pounds. The upper limit of the interval is 30 plus two equals 32 pounds. This range of values expresses the uncertainty in your estimate due to random sampling. Calculating the margin of error involves multiplying the standard error by a Z score. Remember, a Z score measures the distance of the data point from the population mean in a standard normal distribution. Typically, you'll use a computer for these calculations. Along with the sample statistic and margin of error, a confidence interval also includes a confidence level. The confidence level describes the likelihood that a particular sampling method will produce a confidence interval that includes the population parameter. For example, say you use a 95% confidence level to calculate a confidence interval between 28 and 32 pounds. Technically, this means that if you took 100 random samples from the penguin population and calculated a 95% confidence interval for each sample, then approximately 95 of the 100 intervals or 95% of the total would contain the actual population mean. One such interval will be the range of values between 28 and 32 pounds. If this explanation seems rather abstract right now, don't worry. In a later video, we'll discuss confidence level in more detail. As a data professional, you can choose your own confidence level based on the desired accuracy of your estimate. Common confidence levels are 90, 95, and 99%. 95% is a popular choice. For example, most election polls report a 95% confidence level and most AB tests recommend using a confidence level of 95%. Note that there is nothing magical about 95%. This is a choice based on tradition in statistical research and education. You can adjust the confidence level to meet the requirements of your analysis. Let's explore another example. Imagine you're a data professional working for a fashion company. Your manager asks you to estimate sales revenue for the new line of spring clothing. When you meet with stakeholders, you might say, I think we'll do $1 million in sales. Or you might say, based on a 95% confidence level, I estimate that our sales revenue will be between $950,000 and $1,500,000. The first statement offers a point estimate. The second statement provides a confidence level and an interval estimate and communicates the uncertainty in the estimate. It gives your stakeholders more information and helps them make more informed decisions about issues related to future sales revenue. As a data professional, you also have to make sure your stakeholders understand your results. So it's your job to clearly communicate how to interpret a confidence interval. We'll discuss interpretation more later on. Recently, you learned that data professionals use confidence intervals to express the uncertainty in their results. To better understand your results and effectively communicate them to stakeholders, it's important to know how to properly interpret a confidence interval. Confidence intervals are one of the most misunderstood concepts in statistics. Because it's a complicated topic, both new students and experienced researchers sometimes make inaccurate statements about confidence intervals. So if you don't get the concept right away, don't worry, you're not alone. By the end of this video, you'll have a better understanding of how to interpret a confidence interval. You'll also learn some common forms of misinterpretation and how to avoid them. Let's explore an example. Imagine you're a data professional who works for an urban planning company in a large city. The city government asked your team to design new parks and walkways that feature red maple trees. For planning purposes, your manager asked you to estimate the mean height of all the red maple trees in the city. That's approximately 10,000 trees. Instead of measuring every single tree, you collect a sample of 50 trees. The mean height of the sample is 50 feet with a standard deviation of 7.5 feet. Based on a 95% confidence level, you calculate a confidence interval for mean height that stretches between 48 feet and 52 feet. This interval estimate will help your team design new parks and walkways that meet city ordinances for landscaping. At this point, you may be wondering, what does it mean to choose a 95% confidence level and to say that you are 95% confident in an interval estimate? Earlier, you learned that the confidence level expresses the uncertainty of the estimation process. Let's talk about what this means from a more technical perspective. 95% confidence means that if you take repeated random samples from a population and construct a confidence interval for each sample using the same method, you can expect that 95% of these intervals will capture the population mean. You can also expect that 5% of the total will not capture the population mean. In practice, data professionals usually select one random sample and generate one confidence interval, which may or may not contain the actual mean. This is because repeated random sampling is often difficult, expensive, and time consuming. Confidence intervals give data professionals a way to quantify the uncertainty due to random sampling. In our example, you have a 95% confidence interval that the mean height is between 48 and 52 feet. For the purpose of this example, let's say the actual mean height of all 10,000 red maples is 51 feet. In practice, you would have no way of knowing this unless you measured every single tree in the city. This means that if you were to take 20 random samples of 50 trees and calculate a confidence interval for each sample, you can expect 19 out of your 20 intervals, or 95% of the total, will capture the population mean of 51 feet. One such interval will be the range of values between 48 and 52 feet. Let's pause for a moment. I know that's a lot of new info to digest. Confidence intervals can be a bit tricky. That's why they're misinterpreted so often. To better understand what it means to say you have 95% confidence in your estimate, let's explore our urban planning example in more detail. Imagine you take another 20 random samples of 50 trees using the same sampling method. Because each sample is randomly selected from a large population, the mean will vary from one sample to the next. Remember, this is called sampling variability. For your first sample of 50 trees, the mean height is 50 feet. For your second sample of 50 trees, the mean height turns out to be 49.5 feet. For your third sample, you get a mean height of 51.5 feet, and so on. Because of sampling variability, the mean height for any given sample will not necessarily be equal to the actual population mean. Confidence intervals help express this uncertainty. The confidence intervals you calculate based on each sample mean will also vary from one sample to the next. And any given interval will not necessarily contain the population mean of 51 feet. For example, your first sample has a mean height of 50 feet and a confidence interval between 48 feet and 52 feet. This interval captures the population mean of 51 feet. Your second sample has a mean height of 49.5 feet and a confidence interval between 47.5 to 50.5 feet. This interval does not capture the population mean of 51 feet. However, a 95% confidence level means that you can expect that 19 out of your 20 intervals, or 95% of the total, will capture the population mean. In other words, this method will produce an interval that contains the population mean with a success rate of 95%. And that's a pretty good success rate. Now that you have a better understanding of how to interpret a confidence interval, let's review three common misinterpretations of this concept. Being aware of these misinterpretations will help you avoid them in the future. The first common misinterpretation of confidence intervals is that a 95% confidence interval means that 95% of all the data values in your data set fall within the interval. This is not necessarily true. For example, your 95% confidence interval for tree heights is between 48 feet and 52 feet. It may not be accurate to say that 95% of all the values in your data set fall in this interval. It's possible that over 5% of the tree heights in your data set are outside this interval, either shorter than 48 feet or taller than 52 feet. The second common misinterpretation is that a 95% confidence interval implies that 95% of all possible sample means fall within the range of the interval. This is not necessarily true. For example, your 95% confidence interval for tree height is between 48 feet and 52 feet. Imagine you take repeated samples using the same sampling method. It's possible that over 5% of your sample means will be less than 48 feet or greater than 52 feet. A third common misinterpretation is to assume that a confidence interval refers to the only possible source of error in your results. While every confidence interval includes a margin of error, many other kinds of errors can enter into statistical analysis. For example, the questions in a survey may be poorly designed or sampling bias may affect the sample data. The margin of error is a useful measure of uncertainty and makes your estimate more reliable, but it's not the only possible source of error in your analysis. So when you're interpreting a confidence interval, remember that the uncertainty lies in an estimation process based on random sampling. A 95% confidence level refers to the success rate of that process. In other words, you can expect 95% of the random intervals you generate to capture the population parameter. Knowing how to properly interpret confidence intervals will give you a better understanding of your estimate and help you share useful and accurate information with stakeholders. You may need to explain the common misinterpretations too and why they're incorrect. You don't want your stakeholders to get the wrong idea or base their decisions on a misinterpretation. Understanding how to effectively communicate your results to stakeholders is an important part of being a data professional. Recently, you learned that data professionals use confidence intervals to describe the uncertainty of an estimate for population mean or proportion. In this video, you'll learn how to construct a confidence interval for a proportion. We'll go step-by-step through an example involving election polling. Later on, we'll cover means. Imagine you are a data professional working for a polling agency. There is an upcoming election for a governor between two candidates, Tiffany Davis and Maya Cruz. Your agency represents the Davis campaign. Election Day is four weeks away. The Davis team asks you to conduct a poll to find out how their candidate is doing. You collect a random sample of 100 voters from the total population of 100,000 voters. You ask them which candidate they plan on voting for. The results show that 55% of voters prefer Davis and 45% of voters prefer Cruz. The poll favors your candidate. If Davis gets over 50% of the vote on election day, it's a win, so 55% is a good result. Great news, right? But you also know that this is only one random sample of 100 voters out of a large population. If you took another random sample of 100 voters, you might get different results. If you took a third sample, the results might differ again, and so on. In other words, your single sample may not provide the actual population proportion or percentage of all voters that will vote for Davis on election day. For example, on election day, Davis may get 52%, which is good enough to win, or 49%, which is not. So instead of relying on a point estimate as proof that your candidate will win the election, you can use your sample data to construct a confidence interval. This will give the campaign team a better idea of the uncertainty of your estimate and of the possible election results. So let's construct a confidence interval now. Let's review the steps for constructing a confidence interval. First, identify a sample statistic. Second, choose a confidence level. Third, find the margin of error. And fourth, calculate the interval. First, identify your sample statistic. Your poll represents the percentage of voters who prefer your candidate, which is 55%. This is a sample proportion. Next, choose a confidence level for your poll. Most election polls report a 95% confidence level. The Davis campaign also requests that you use a 95% confidence level in your calculations. Your third step is to find the margin of error. The margin of error refers to the range of values above and below your sample statistic. If you're working with a normal distribution and a large sample size, one way to calculate the margin of error is by multiplying the Z score by the standard error. Let's break that down. To review, a Z score measures the distance of a data point from the population mean in a standard normal distribution. For example, a Z score of one is one standard deviation above the mean. A Z score of negative 1.5 is 1.5 standard deviations below the mean. This table shows the Z scores that correspond to popular confidence levels. 1.645 for 90%, 1.96 for 95%, and 2.58 for 99%. If you choose a 95% confidence level, use a Z score of 1.96 to calculate the margin of error. Now, you need to calculate your standard error. You may recall that the standard error measures the variability of your sample statistic. It shows how much your sample proportion is likely to differ from the actual population proportion. The larger the standard error, the more variability in your sample. The formula for the standard error of the proportion is the square root of the sample proportion times one minus the sample proportion divided by the sample size. Your sample proportion is 0.55, and your sample size is 100. If you enter the numbers into the formula, you get a standard error of about 0.05. So let's put that all together. The margin of error is your Z score of 1.96, multiplied by your standard error of 0.05. This equals 0.98. Finally, the last step in the process to construct a confidence interval is to calculate the interval. The upper limit of your interval is the sample proportion plus the margin of error, or 0.55 plus 0.098. This equals 0.648, or 64.8%. The lower limit is the sample proportion minus the margin of error. This is 0.55 minus 0.098 equals 0.452, or 45.2%. Therefore, you have a 95% confidence interval that stretches from 45.2% to 64.8%. While your confidence interval mostly lies above 50%, this isn't necessarily a reason to be optimistic about the upcoming election. Since the lower limit of 45.2% falls below 50%. Based on the confidence interval, losing the election is still a possibility. The campaign team may want to invest more in TV or social media advertising to ensure victory. Or if the campaign team wants a more accurate estimate of the election results, they may request another poll with a larger sample size. This will give a more accurate estimate because it includes more voters. Let's say you conduct another poll with a sample size of 1,000 voters. The new poll reports that 54% of voters prefer candidate Davis. If you calculate a 95% confidence interval using these numbers, your interval will stretch from 50.9% to 57.1%. The lower limit of your interval is now above 50%. This should give the Davis team more confidence about the upcoming election. Of course, there's still a chance their candidate may lose the election, since the confidence level is 95%, not 100%. You may notice that as the sample size gets larger, the confidence interval gets narrower. With the sample of 100, the interval covers 19.6 percentage points. With the sample size of 1,000, the interval covers 6.2 percentage points. This is because as your sample size increases, your margin of error decreases. If you could sample every member of the population, the margin of error would be zero. But of course, it's often too expensive and time consuming to sample an entire population or to take repeated samples. Data professionals typically work for the single random sample of a large population. Confidence intervals help data professionals give more reliable estimates based on the available data. And based on your data, your candidate will likely win the election. Previously, you learned that data professionals use confidence intervals to express the uncertainty of an estimate. Then you constructed a confidence interval for the proportion of votes in an upcoming election. In this video, you'll construct another confidence interval, but this time for a mean. The basic process is the same as the one you use for proportion, but requires new calculations. We'll go step by step through an example involving the marketing of a new cell phone. Imagine you're a data professional working for a company that produces cell phones. Recently, the company developed a phone with an extended battery life. It's designed to operate for at least 20 hours without recharging. This is a big upgrade in battery life and will boost sales. The marketing team is planning an ad campaign about the new battery to help sell the phone. Management wants to make sure the claim about 20 hours of battery life is accurate for the ad scope public. They ask you to analyze the data and make a reliable estimate for battery life of the new phone. The company has produced 100,000 new phones. The product engineering team tests a random sample of 100 phones and records the data about battery life. Based on the data, you know that the sampled mean duration for battery life is 20.5 hours and the sample standard deviation is 1.7 hours. And based on the campaign's data about the standard manufacturing process of the batteries, you also know that the population standard deviation is 1.5 hours. The sample mean is over 20 hours for this test. However, you know that this is only one random sample of 100 phones out of a large population. If you took another random sample of 100 phones, you might get different results. If you took a third random sample, the results might differ again and so on. Your single sample may not provide the actual mean battery life for all the phones. The population mean for battery life could be 19 hours, 21 hours or something else. You can use your sample data to construct a confidence interval that likely includes the population mean for the phone's battery life. This will give the marketing team a better idea of the uncertainty in your estimate. It will also help them decide how to advertise the phone and whether they can claim that its battery lasts 20 hours or more. Let's review the steps for constructing a confidence interval. First, identify a sample statistic. Second, choose a confidence level. Third, find the margin of error. And fourth, calculate the interval. First, identify your sample statistic. Your sample represents the average duration of battery life for 100 cell phones. In this example, you're working with the sample mean. Next, choose a confidence level. Management requests that you choose a 95% confidence level. This is the company standard for new products. Your third step is to find the margin of error, which refers to the range of values above and below your sample statistic. You can calculate the margin of error by multiplying the Z score by the standard error. You may recall that the Z score you use depends on your confidence level. This table shows the Z scores that correspond to popular confidence levels such as 90, 95, and 99%. The Z score for a 95% confidence level is 1.96. Now you can calculate the standard error, which measures the variability of your sample statistic. It shows how much your sample mean is likely to differ from the actual population mean. The larger the standard error, the more variability. The formula for the standard error of the mean is the population standard deviation divided by the square root of the sample size. Your population standard deviation is 1.5 and your sample size is 100. If you enter the numbers into the formula, you get a standard error of 0.15. The margin of error is your Z score 1.96 multiplied by your standard error 0.15. This equals 0.294. Finally, calculate your confidence interval. The upper limit of your interval is the sample mean plus the margin of error. This is 20.5 plus 0.294 equals 20.794 hours, or about 20 hours and 48 minutes. The lower limit is the sample mean minus the margin of error. This is 20.5 minus 0.294 equals 20.206. Or about 20 hours and 12 minutes. So you have a 95% confidence interval for the battery life of the phone that stretches from 20 hours and 12 minutes to 20 hours and 48 minutes. The confidence interval gives company management important information. The lower limit of your interval, 20 hours and 12 minutes, is above the company's goal of 20 hours. This helps the marketing team feel confident about advertising the battery life of the cell phone to be at least 20 hours. You present your findings to the company's stakeholders and the results satisfy everyone, except the head of marketing. The marketing director has put a lot of time and effort into developing the ad campaign and wants to be even more confident. The director requests that you analyze the data using a 99% confidence level. To make the marketing director happy, you recalculate your results. Use the same sample data but use a 99% confidence level instead of 95%. Your confidence interval now stretches from 20 hours and seven minutes to 20 hours and 53 minutes. The lower limit of your interval is still above 20 hours. This result should give company management even more confidence about the battery life and hopefully satisfy the marketing director. You may notice that as the confidence level gets higher, the confidence interval gets wider. With the confidence level of 95%, the interval covers 36 minutes. With the confidence level of 99%, the interval covers 46 minutes. This is because a wider confidence interval is more likely to include the actual population parameter. Note that in this example, we know that the population standard deviation is 1.5 hours. However, in practice, the population standard deviation is often unknown and has to be estimated based on the sample standard deviation. This is because it's difficult to get complete data on a large population. If you don't know the population standard deviation, this changes the calculations for the confidence interval. To learn more, feel free to check out the relevant reading. As a data professional, you can use confidence intervals to help stakeholders make informed decisions based on accurate estimates. Your analysis of the data will help shape the company's strategy for the new product launch. As a data professional, you play a key role in the future success of the new product. Earlier, we talked about how data professionals use sample data to make point estimates about population parameters. For example, a data professional might take a random sample of 100 home prices in Mexico City to estimate the average price of all homes in Mexico City. A point estimate can provide a general idea of a population parameter, but estimates usually include some error due to sampling variability. And in practice, taking repeated samples to get improved estimates is often too expensive and time consuming. So data professionals use confidence intervals to describe the uncertainty of an estimate and give stakeholders more information to work with. In this video, you'll use Python to construct a confidence interval for a point estimate. We'll continue with our previous scenario in which you're a data professional working for the Department of Education of a large nation. Recall that you're analyzing data on the literacy rate for each district. You'll continue to use the dataset you worked with before. If you need to access the data, do so now. In a previous video, we imagined that the Department of Education asked you to collect the data on district literacy rates. You were only able to survey 50 randomly chosen districts instead of all 634 districts included in your original dataset. You use Python to simulate taking a random sample of 50 districts and making a point estimate of the population mean or literacy rate for all districts. Now, as the next step, imagine that the Department asks you to construct a 95% confidence interval for your estimate of the mean district literacy rate. You can use Python to construct the confidence interval. Let's open up a Jupyter Notebook and begin. First, import the Python packages you plan to use, NumPy and Pandas. To save time, rename your packages with abbreviations, NP and PD. To load the SciPy stats module, write from SciPy import stats. You can also use the same sample data that you worked with in a previous video. Write the code to have Python simulate the same random sample of district literacy rate data. First, name your variable sampled underscore data. Then, enter the arguments of the sample function. N, sample size equals 50. Replace equals true because you are sampling with replacement. For random underscore state, choose the same random number to generate the same results. Previously, you used 31,208. Now, display the value of your variable. The output shows 50 districts selected randomly from your dataset. Each has a different literacy rate. In previous videos, you constructed a confidence interval step by step. Let's review the four main steps. One, identify a sample statistic. Two, choose a confidence level. Three, find the margin of error. Four, calculate the interval. Earlier, you worked through these steps one by one to construct a confidence interval. With Python, you can construct a confidence interval with just a single line of code and get your results faster. If you're working with a large sample size, say larger than 30, you can construct a confidence interval for the mean using scipyas stats.norm.interval function. This function includes the following arguments. Alpha refers to the confidence level. Loc refers to the sample mean. And scale refers to the sample standard error. Let's explore each argument in more detail. First, alpha or your confidence level. The education department requests a confidence level of 95%, which is the accepted standard for government-funded research. Second, loc or the sample mean. This is the mean literacy rate of your sample of 50 districts. Name a new variable sample underscore mean. Then, compute the mean district literacy rate for your sample data. Third, scale or the sample standard error. Recall that standard error measures the variability of your sample data. You may remember that the formula for the sample standard error is the sample standard deviation divided by the square root of the sample size. You can write code to express the formula and have Python do the calculation for you. First, name a new variable estimated underscore standard underscore error. Next, take the standard deviation of your sample data and divide by the square root of your sample. Then, in parentheses, write the name of your data frame followed by the shape function and zero in brackets. Recall that the shape function returns the number of rows and columns in a data frame. Shape zero returns only the number of rows, which is the same number as your sample size. Now, you're ready to put all this together to construct your confidence interval with the stats.norm.interval function. First, write out the function and set the arguments. For alpha, set 0.95 because you want to use a 95% confidence level. For loc, enter the variable sample underscore mean. And for scale, enter the variable estimates underscore standard underscore error. Then, run the code. And out comes your confidence interval. Python makes this process super efficient. You have a 95% confidence interval for the mean district literacy rate that stretches from about 71.4% to 77.0%. The Department of Education will use your estimate of the mean district literacy rate to help make decisions about distributing funds to different states. Now, imagine that a senior director in the department wants to be even more confident about your results. The director wants to make sure you have a reliable estimate and suggests that you recalculate your interval with a 99% confidence level. To choose a new confidence level, copy and paste your previous code. Change your alpha from 0.95 to 0.99 to compute a 99% confidence interval based on the sample data. Now, run the code. And here's your confidence interval. You have a 99% confidence interval for the mean district literacy rate that stretches from about 70.5% to 77.9%. You may notice that as the confidence level gets higher, the confidence interval gets wider. With a confidence level of 95%, the interval covers 5.6 percentage points. With a confidence level of 99%, the interval covers 7 percentage points. This is because a wider confidence interval is more likely to include the actual population parameter. In our scenario for this video, you only have data on 50 districts. However, in earlier videos, you computed the mean literacy rate for all 634 districts in your data set as about 73.4%. So as it turns out, both your confidence intervals capture the actual population mean. Your results will help the Department of Education decide how to distribute government resources to improve literacy. This is the end of your introduction to confidence intervals. You've come a long way since the beginning of the course. Congrats on all your progress. We've talked often about how data professionals use sample statistics to make estimates about population parameters. In this part of the course, we estimated the mean emission rate of a car engine and the mean battery life of a cell phone. Confidence intervals help express the uncertainty in an estimate and share a range of possible outcomes. For instance, I can see the marketing campaign will generate $200,000 in new sales. Or I can say based on a 95% confidence level, I estimate the marketing campaign will generate between $150,000 and $250,000 in new sales revenue. Both predictions might be reasonable, but the confidence interval expresses the uncertainty in the estimate and gives a stakeholder more information to work with. And having more information helps them make better decisions. Sharing reliable estimates with stakeholders has a positive impact on an organization. For example, imagine you're a data professional working with a shipping company that transports products around the world. You can use confidence intervals to help estimate economic factors, such as fuel prices, shipping costs, local tariffs, and more. This information helps company leaders minimize risk, avoid unnecessary expenses, and increase efficiency. And improving the speed and security of shipping will benefit the thousands of people who rely on your company's services. In this part of the course, we discussed the role of confidence intervals in data analytics, and we reviewed the basic steps for constructing a confidence interval. Next, you learned how to interpret a confidence interval and how to avoid common misinterpretations of their results. We listed the steps for constructing a confidence interval from identifying a sample statistic and choosing a confidence level to finding the margin of error and specifying the interval. Then, you learned how to construct confidence intervals for both means and proportions. Finally, use Python's sci-pi stats module to construct a confidence interval for a point estimate of a population mean. Soon, you'll take a graded assessment. To prepare, check out the reading that lists all the new terms you've learned. And feel free to revisit videos, readings, and other resources that cover key concepts. You're doing a great job. Keep it up. Hey there, future data professional. You've come a long way since the beginning of your learning journey. Way to go. Just think of all the new skills you've developed. You can calculate descriptive stats like the mean and standard deviation to summarize your data. You can use probability distributions like the binomial, Poisson, and normal distribution to model different types of data. You can work with sampling distributions to estimate population means and proportions. And you can construct confidence intervals to help describe the uncertainty of an estimate. Now, you're going to add a new skill to your skill set, hypothesis testing. Hypothesis testing is a statistical procedure that uses sample data to evaluate an assumption about a population parameter. For example, hypothesis tests are often used in clinical trials to determine whether a new medicine leads to better outcomes in patients. Imagine a pharmaceutical company invents a medicine to treat the common cold. The company tests a random sample of 200 people with cold symptoms. Without medicine, the typical person experiences cold symptoms for 7.5 days. The average recovery time for people who take the medicine is 6.2 days. The company might then ask, are the results of the clinical trial statistically significant? Recall that statistical significance is the claim that the results of a test or experiment are not explainable by chance alone. In other words, did the drug actually have a positive impact on recovery time? Or are the results due to chance or sampling variability? To answer these questions, the company may ask a data professional to conduct a hypothesis test. The test helps quantify whether the result is likely due to chance or if it's statistically significant. This knowledge will help the company determine if the drug is truly effective and if it should be approved for public use. Coming up, we'll go over the general procedure for hypothesis testing from stating the null hypothesis and alternative hypothesis to choosing a significance level, finding the p-value and rejecting or failing to reject your null hypothesis. Then we'll explore two different types of hypothesis tests, one sample and two sample. Finally, you'll learn how to use Python's SciPyStats module to conduct a two sample hypothesis test to compare two population means. When you're ready to learn more, I'll meet you in the next video. Hi, my name is Elia and I'm a Data Science intern at Google. So I am originally from France and I grew up there and did most of my study there and I have always been very interested in math and during the beginning of my college education, I also discovered computer science and starting building skills in computer science and for my first internship in France, I was looking for ways to apply these two fields to a real life question and I came upon data science internships and this made me wanna continue my career in data science and specialize in data science. And I've always been looking at the different applications of data science and AI in healthcare and this is why I was very interested in working at Google and specifically at Verily. What excites me the most is being able to work on data science related to healthcare and generate insights and build models that can actually have an impact on either patients or the healthcare industry as a whole. A data science intern at Google is given a project to work on during the duration of the internship and I specifically worked on clinical natural language processing which is basically the ability to extract relevant information from clinical notes using machine learning methods and to generate insights from them. I definitely think my internship has prepared me so much for a career in data science. As a data scientist, you're always learning new things. There's always new state of the art models coming out and it's always really interesting to keep up to date with them and to really learn how they work so that you're able to tailor them to your use case. One of the parts of the project was to extract social determinants of health from patient records and notes and then I explored the relationships between the social determinants of health and some really interesting associations came up. Some things that were already known but I feel like it was really cool to be able to observe them myself and my work so I think what's so great about data science and being a data scientist is that data science can have an impact on any field so if you're interested in something that has nothing to do with data science I'm pretty sure that you can find a job that relates to it, look at the industries that you're interested in and most likely there's gonna be a data science job there. Recently, you learned that a hypothesis test uses sample data to evaluate an assumption about a population parameter. For example, for a clinical trial of a new medicine a hypothesis test can help you determine if the effect of the medicine on the average recovery time of your sample group is statistically significant or due to chance. In this video, we'll go over the procedure for performing a hypothesis test and you'll learn about the main concepts involved in hypothesis testing. Let's outline the steps for performing a hypothesis test. First, state the null hypothesis and the alternative hypothesis. Second, choose a significance level. Third, find the p-value and fourth, reject or fail to reject the null hypothesis. Right now, these concepts may seem a bit abstract. That's okay. You'll learn more about each one by the end of this video. For now, to clarify the steps involved in hypothesis testing, let's explore an example. Afterwards, we'll revisit the concepts. Imagine you are given a coin to use in a game. You're not sure if the coin is fair or rigged. That is, you don't know if it's a standard coin or if it's been specifically weighted to affect the outcome of a toss. For example, to always land on tails. Before using it in the game, you want to find out if the coin is fair or not. You decide to test the coin by tossing it six times in a row and recording the outcomes. As you may recall from our earlier discussion of probability, if the coin is fair, the chance of landing on heads or tails is 0.5 or 50% for any given toss. If the coin is rigged for tails, the chance of landing on tails for any given toss will be much higher, perhaps 90 or even 100%. Before you test it out, you need to have a benchmark to evaluate the results of the test. For example, let's say the first two tosses land on tails. Is the coin rigged? Recall that we use the multiplication rule to calculate the probability of independent events. So the probability of the coin landing on tails two times in a row is 0.5 times 0.5, which is 0.25 or 25%. That's not unlikely. At this point, you can't reasonably conclude that the coin is rigged. Now imagine the coin lands on tails four times in a row. The probability of this occurring is 0.5 multiplied four times, which is 0.0625 or 6.25%. That's unlikely, but not impossible. However, you want to feel even more confident that the outcome is not due to chance. You decide to use 5% as a threshold to determine if the outcome is due to chance. In other words, if the probability of the outcome is less than 5%, under the assumption that the coin is fair, you'll conclude that the coin is actually rigged. For example, the probability for a fair coin to land on tails six times in a row is 0.5 multiplied six times, which is 0.0156 or 1.56%. This is too unlikely, because the likelihood is less than your threshold of 5%. If this occurs, you'll conclude the coin is rigged. Now you're ready to proceed with your test. You toss the coin six times in a row and record the results. The coin lands on tails every time. You conclude that the coin is rigged. Unfortunately, the coin won't be of much use to you, unless you're performing the magic trick and happen to need a coin that always lands on tails. This example is a simplified version of a hypothesis test. To test out whether or not the coin is fair, you went through each step of the hypothesis testing procedure. Let's review the steps for conducting a hypothesis test. First, state the null hypothesis and the alternative hypothesis. Second, choose a significance level. Third, find the p-value. Fourth, reject or fail to reject the null hypothesis. Now let's explore each step in more detail. First, state your null hypothesis and your alternative hypothesis. The null hypothesis is a statement that is assumed to be true unless there's convincing evidence to the contrary. The null hypothesis typically assumes that your observed data occurs by chance. The alternative hypothesis is a statement that contradicts the null hypothesis and is accepted as true only if there's convincing evidence for it. The alternative hypothesis typically assumes that your observed data does not occur by chance. In our example, your null hypothesis states that the coin is fair. Having a fair coin is the standard or typical state of things. The null hypothesis states that your observations result purely from chance. Your alternative hypothesis states the contrary claim. The coin is not fair. The alternative hypothesis says that the outcome was the result of rigging and did not happen by chance. Next, choose your significance level. This is the threshold at which you will consider results statistically significant. The significance level is also the probability of rejecting the null hypothesis when it is true. I'll talk more about this later in the video. In our example, you use 5% as a threshold determine if the outcome of the coin toss occurred by chance. Typically data professionals set the significance level at 5%. Note that there's nothing magical about 5%. This is a choice based on tradition in statistical research and education. You can adjust the significance level to meet the requirements of your analysis. Other common choices are 1% and 10%. Next, find your p-value. P-value refers to the probability of observing results as or more extreme than those observed when the null hypothesis is true. We already calculated that the probability of fair coin landing on tails six times in a row is 1.56%. So if you assume the null hypothesis is true and the coin is fair, then the p-value in our example is 1.56%. Anything lower than that would mean that there is stronger evidence for the alternative hypothesis. Remember, your alternative hypothesis is that the coin is not fair. For instance, the probability of tossing seven tails in a row with a fair coin is 0.79%, which is lower than the p-value of 1.56%. If you tossed seven tails in a row, you'd have even stronger evidence for the alternative hypothesis that the coin is not fair. Finally, you have to decide whether to reject or fail to reject the null hypothesis. Statisticians always say fail to reject rather than accept. This is because hypothesis tests are based on probability, not certainty. And acceptance implies certainty. In general, as data professionals, we try not to claim certainty about results based on statistical methods. There are two main rules for drawing a conclusion about a hypothesis test. If your p-value is less than your significance level, you reject the null hypothesis. If your p-value is greater than your significance level, you fail to reject the null hypothesis. In the example of the coin toss, your p-value of 1.56% is less than your significance level of 5%. So you reject the null hypothesis and conclude that your result of six consecutive tails is statistically significant and not due to chance. Your decision to reject or fail to reject also depends on your significance level. Let's say before your test, you chose a significance level of 1% instead of 5%. In that case, you would fail to reject the null hypothesis because your p-value of 1.56% would be greater than your significance level of 1%. A statistically significant result cannot prove with 100% certainty that a hypothesis is correct. Because hypothesis testing is based on probability, there's always a chance of drawing the wrong conclusion about the null hypothesis. In hypothesis testing, there are two types of errors you can make when drawing a conclusion. I type one error and I type two error. I type one error, also known as a false positive, occurs when you reject a null hypothesis that is actually true. In other words, you conclude that your result is statistically significant when in fact it occurred by chance. In our example, concluding that the coin is rigged when it's actually fair would be considered to type one error. Even though you've got six tails in a row, this outcome could still be due to chance. It's highly unlikely, but it's possible. Earlier in the video, I mentioned that your significance level is also the probability of rejecting the null hypothesis when it is true. A significance level of 5% means you're willing to accept a 5% chance you are wrong when you reject the null hypothesis. To reduce your chance of making a type one error, choose a lower significance level. Recall that if you choose a 1% significance level, you fail to reject the null and conclude that the coin is fair. However, choosing a lower significance level means you're more likely to make a type two error or false and negative. This occurs when you fail to reject a null hypothesis which is actually false. In other words, you conclude your result occurred by chance when it's in fact statistically significant. In our example, you would conclude that the coin is fair when it's actually rigged. As a data professional, it helps to be aware of the potential errors built into hypothesis testing and how they can affect your results. Depending on the situation and the goal of your analysis, you may want to minimize the risk of either a type one or type two error. Imagine you're testing the strength of a fabric for a parachute manufacturer. You want to be very confident that the material you're using is strong enough for a functional parachute. A type one error or false positive means you falsely identify the material as strong enough. Obviously, in this case, you want to minimize the risk of a type one error. To do so, choose a significance level of 1% instead of the standard 5%. This change decreases the chance of a type one error or false positive from 5% to 1%. Ultimately, it's your responsibility as a data professional to determine how much evidence you need to decide that this result is statistically significant. How risky is a type one error or a false positive? There's no single correct answer for all these situations. It's up to you to decide. That's all for now. The coin toss example shows you the main concepts involved in conducting a hypothesis test. As a data professional, you'll use these concepts for any hypothesis test that you may want to conduct. Earlier, I mentioned that there are two different types of hypothesis tests. One sample and two sample. A one sample test determines whether or not a population parameter like a mean or proportion is equal to a specific value. A two sample test determines whether or not two population parameters such as two means or two proportions are equal to each other. We'll explore two sample tests later on. A data professional might conduct a one sample hypothesis test to determine if a company's average sales revenue is equal to a target value, a medical treatment's average rate of success is equal to a set goal, or stock portfolio's average rate of return is equal to a market benchmark. In this video, you'll conduct a one sample hypothesis test. We'll explore an example involving data for an online delivery company. There are different types of hypothesis tests you can use based on the type of sample data you're working with. In this course, we focus on Z tests and T tests, two of the most commonly used tests in data analytics. If these terms seem familiar, it's because they're related to Z scores and T scores, which we've worked with before. The one sample Z test makes the following assumptions. The data is a random sample of a normally distributed population. The population standard deviation is no. Imagine you're a data professional who works for an online delivery company. Typically, the mean delivery time for an online food order is 40 minutes with a standard deviation of five minutes. But recently, company management launched a new training program to make the delivery process more efficient. After delivery drivers completed the training program, management tracked a random sample of 50 deliveries to understand how long a delivery takes. The sample of 50 deliveries had a mean delivery time of 38 minutes with a standard deviation of five minutes. There is an observed difference of two minutes between the population mean of 40 minutes and the sample mean of 38 minutes. The management team asked you to determine if the decrease in average delivery time is statistically significant or if it's due to chance. If the decrease is statistically significant, the company wants to invest in developing and implementing the training program in other regions. You decide to conduct a one-sample Z test to analyze the data. Let's review the steps for conducting a hypothesis test. First, state the null hypothesis and the alternative hypothesis. Second, choose a significance level. Third, find the P value. Fourth, reject or fail to reject the null hypothesis. Start by stating your null hypothesis and alternative hypothesis. Recall that the null hypothesis is a statement that is assumed to be true unless there is convincing evidence to the contrary. In a one-sample Z test, the null hypothesis states that the population mean is equal to an observed value. In this case, your null hypothesis says that the average delivery time equals 40 minutes. 40 minutes is the standard average delivery time. The alternative hypothesis is a statement that contradicts the null hypothesis. In a one-sample test, there are three main options for the alternative hypothesis. The population mean is not equal to less than or greater than an observed value. In this case, you want to test whether the training has decreased the average delivery time. Your alternative hypothesis says the average delivery time is less than 40 minutes. Next, set the significance level or the threshold at which you will consider a result statistically significant. This is also the probability of rejecting the null hypothesis when it is true. You choose a significance level of 5%, which is the company's standard for data analysis. Now, find the p-value. Recall that your p-value is the probability of observing a difference in your results as or more extreme than the difference observed when the null hypothesis is true. Typically, the mean delivery time is 40 minutes. The mean delivery time of your sample is 38 minutes. Your null hypothesis claims that this two-minute difference in delivery time is due to chance or sampling variability. Your p-value is the probability of observing a difference that is two minutes or greater if the null hypothesis is true. If the probability of this outcome is very unlikely, in particular, if your p-value is less than your significance level of 5%, then you will reject the null hypothesis. As a data professional, you'll almost always calculate p-value on your computer using a programming language like Python or other statistical software. However, let's briefly explore the concepts involved in the calculation to get a better understanding of how it works. Being able to use code for calculations is important for your future career, but being familiar with the concepts behind the calculations will help you apply statistical methods to workplace problems. The p-value is calculated from what's called a test statistic. In hypothesis testing, the test statistic is a value that shows how closely your observed data matches the distribution expected under the null hypothesis. So, if you assume the null hypothesis is true and the mean delivery time is 40 minutes, the data for delivery times follows a normal distribution. The test statistic shows where your observed data, a sample mean delivery time of 38 minutes, will fall on that distribution. Since you're conducting a Z test, your test statistic is a Z score. Recall that a Z score is a measure of how many standard deviations below or above the population mean a data point is. Z scores tell you where your values lie on a normal distribution. The following formula gives you a test statistic, Z, based on your sample data. Z equals X bar minus mu divided by the quotient of sigma divided by the square root of n. Where X bar is the sample mean, mu is the population mean, sigma is the population standard deviation, and n is the sample size. If you enter the numbers in the formula and do the calculation, you get a Z score of negative 2.82. Let's check out where the Z score, negative 2.82, lies on the distribution. It's far to the left, almost three standard deviations below the mean. For a normal distribution, the probability of getting a value less than your Z score of negative 2.82 is calculated by taking the area under the curve to the left of the Z score. This is called a left-tailed test, because your p-value is located on the left tail of the distribution. The area under this part of the curve is the same as your p-value. Again, your p-value is the probability of observing a test statistic as or more extreme than that observed when the null hypothesis is true. Your alternative hypothesis states that the mean delivery time decreased based on your sample data. That's why we're interested in the probability of getting any value as low or lower than your Z score of negative 2.82. In a different testing scenario, your test statistic might be positive 2.45 and you might be interested in values as high or higher than the Z score 2.45. In that case, your p-value would be located on the right tail of the distribution and you'd be conducting a right-tailed test. If you calculate the p-value, you'll find that it's 0.0023. So your p-value is 0.0023 or 0.23%. This means there is a 0.23% probability that the difference in mean delivery time would be two minutes or greater if the null hypothesis is true. In other words, it's highly unlikely that the difference is due to chance. To draw a conclusion about your null hypothesis, compare your p-value with the significance level. If your p-value is less than the significance level, you conclude that there is a statistically significant difference in mean delivery time. In other words, you reject the null hypothesis. If your p-value is greater than the significance level, you conclude that there is not a statistically significant difference in mean delivery time. In other words, you fail to reject the null hypothesis. Your p-value of 0.0023 or 0.23% is less than the significance level of 0.05 or 5%. So you reject the null hypothesis and conclude that there is a statistically significant difference in mean delivery time. Your results suggest that the faster delivery time is likely due to the positive effects of the training. Your analysis will help company leadership decide whether to make a bigger investment in the training program going forward. Based on your results, they're likely to do so. Earlier, you conducted a one sample hypothesis test to analyze data on the mean delivery time for an online food delivery service. Coming up, we'll explore a two sample test for means. While a one sample test determines whether a population mean is equal to a specific value, a two sample test determines whether two population means are equal to each other. In data analytics, two sample tests are frequently used for AB testing. In my career as a data professional, I've conducted a number of AB tests to help companies improve their online business. Typically, I use a two sample T test to analyze the data. For example, let's say an online retail store is considering changing the landing page for its reward club members, who are the most loyal customers. The metric that matters most to the company is the average time users spend on the landing page per session. First, I'd set up an experiment for two groups of users. Group A uses the default landing page and Group B uses a redesigned version of the landing page. Then I'd use a T test to compare the average time spent on each landing page and determine if the difference between the two sample means is statistically significant. In other words, if Group B spends more time on the landing page than Group A, the T test will help determine if that's due to chance or to the new design of the landing page. In this video, you'll conduct a two sample T test to compare two population means. We'll work through an example based on the scenario I just shared with you. In data analytics, the two sample T test is the standard approach for comparing two means. The two sample T test for means makes the following assumptions. The two samples are independent of each other. For each sample, the data is drawn randomly from a normally distributed population. The population standard deviation is unknown. Typically, data professionals use a Z test when the population standard deviation is known and use a T test when the population standard deviation is unknown and needs to be estimated from the data. In practice, the population standard deviation is usually unknown because it's difficult to get complete data on large populations. So data professionals use the T test for practical applications. While the test statistic for a Z test is a Z score, the test statistic for a T test is a T score. And while Z scores are based on the standard normal distribution, T scores are based on the T distribution. The graph of the T distribution has a bell shape that is similar to the standard normal distribution. But the T distribution has bigger tails than the standard normal distribution does. The bigger tails indicate the higher frequency of outliers that come with a small data set. As the sample size increases, the T distribution approaches the normal distribution. For a T test, the test statistic follows a T distribution under the null hypothesis. Let's explore how to conduct a two sample T test. Imagine you're a data professional who works for a cosmetics company. The company is researching the amount of time customers spend on its website. Your team leader asks you to conduct an AB test to determine if changing the background color of the landing page from gray to green has any effect on the average time spent on the page. You randomly select two groups of users. The first group visits the gray landing page named version A. The second group visits the green landing page named version B. You collect the following data from the AB test. 40 users visit version A. They spend a mean time of 300 seconds with a standard deviation of 18.5 seconds. 38 users visit version B. They spend a mean time of 305 seconds with a standard deviation of 16.7 seconds. There's an observed difference of five seconds or 305 minus 300 between the mean time spent on version B and version A. You decide to conduct a two sample T test to analyze the data. Let's review the steps for conducting a hypothesis test. First, state the null hypothesis and the alternative hypothesis. Second, choose a significance level. Third, find the P value. Fourth, reject or fail to reject the null hypothesis. First, state your null hypothesis and alternative hypothesis. In a two sample T test, the null hypothesis states there is no difference between the two population means. This is assumed to be true unless there's convincing evidence to the contrary. So, for your null hypothesis, you say there is no difference in the mean time spent on version A and version B. Your alternative hypothesis will state the contrary claim. There is a difference in the mean time spent on version A and version B. Next, set the significance level or the threshold at which you will consider a result statistically significant. This is the probability of rejecting the null hypothesis when it is true. You choose a significance level of 5%, which is the company's standard for A-B testing. Now, find the P value. In this case, your P value is the probability of observing a difference in your sample means as or more extreme than the difference observed when the null hypothesis is true. Based on your sample data, the difference between the mean time spent on version A and version B is five seconds. Your null hypothesis claims that this difference is due to chance. Your P value is the probability of observing an absolute difference in sample means that is five seconds or greater if the null hypothesis is true. If the probability of this outcome is very unlikely, in particular, if your P value is less than your significance level of 5%, then you will reject the null hypothesis. As a data professional, you'll almost always calculate P value on your computer using a programming language like Python or other statistical software. Later on, you'll use Python for a two sample T test. To find your P value, first calculate your test statistic. Since you're conducting a T test, you're working with a T score. Use the following formula to calculate the test statistic T based on your sample data. T equals X1 bar minus X2 bar divided by the square root of parentheses of S1 squared divided by N1 plus S2 squared divided by N2 close parentheses. Where X1 bar and X2 bar are the sample means of your two groups. N1 and N2 are the sample sizes of your two groups and S1 and S2 are the sample standard deviations of your two groups. If you enter the numbers in the formula and do the calculation, you get a test statistic T of negative 1.2508. For a T test, the test statistic follows a T distribution under the null hypothesis. Recall that your alternative hypothesis states that there is a difference in the meantime spent on version A and version B. The observed difference is five seconds. So if you find a statistically significant difference between the means, either less than or greater than the observed difference of five seconds, you will reject the null hypothesis. Because you're interested in values in both directions, either less than or greater than your test statistic, your P value is the probability of getting a value less than the T score negative 1.2508 or greater than the T score positive 1.2508. Your P value corresponds to the area under the curve on both the left tail and the right tail of the distribution. This is called a two-tailed test. If you calculate the P value, you'll find that it's 0.2148 or 21.48%. This means that there is a 21.48% probability that the absolute difference between this meantime spent on version A and version B would be five seconds or greater if the null hypothesis is true. To draw a conclusion, compare your P value with the significance level. If your P value is less than the significance level, you conclude that there is a statistically significant difference in the means between the two versions. In other words, you reject the null hypothesis that there is no difference in the meantime spent on version A and version B. If your P value is greater than the significance level, you conclude that there is not a statistically significant difference between the two versions. In other words, you fail to reject the null hypothesis that there is no difference in the meantime spent on version A and version B. Your P value of 0.2148 or 21.48% is greater than the significance level of 0.05 or 5%. So you fail to reject the null hypothesis and conclude that there is not a statistically significant difference between the meantime spent on version A and version B. In other words, the observed difference in meantime spent is likely due to chance. Your analysis will help the company decide how to redesign their website. Since there is not a statistically significant difference in meantime spent based on the background colors of gray and green, you can recommend the company either test for a different color, such as blue or yellow, or test for a different design feature, such as text size or button shape. Perhaps a different design change will have an impact on the meantime spent by customers on landing page. Recently, you learned that data professionals use a two-sample t-test to compare two population means. For example, we conducted a two-sample t-test to compare the meantime spent on two different versions of a landing page for a cosmetics company. In this video, you'll conduct a two-sample z-test to compare two population proportions. Recall that, for technical reasons, t-tests do not apply to proportions. A data professional might use a two-sample z-test to compare the proportion of defects among manufacturing products on two assembly lines, side effects to a new medicine for two trial groups, support for a new law among registered voters in two districts. Let's explore an example. Imagine you're a data professional working for an international construction company. The company has offices in London and Beijing. The human resources team would like to determine whether there is a difference in the level of employee satisfaction between the Beijing office and the London office. The team surveys a random sample of 50 employees in each office to discover if they are satisfied with their current job. They ask you to find out if there's a statistically significant difference in the proportion of satisfied employees in London and Beijing. If so, the HR team will devote resources to investigating why employees at one office are more satisfied at work. According to the survey results, 67% of the employees in the London office report being satisfied with their job and 57% of the employees in the Beijing office report being satisfied with their job. There's an absurd difference of 10 percentage points or 67 minus 57 between the proportion of satisfied employees in London and Beijing. You decide to conduct a two-sample z-test to analyze the data. Let's review the steps for conducting a hypothesis test. First, state the null hypothesis and the alternative hypothesis. Second, choose a significance level. Third, find the p-value. Fourth, reject or fail to reject the null hypothesis. First, state your null hypothesis and alternative hypothesis. In a two-sample z-test, the null hypothesis states that there is no difference between the proportions of your two groups. This is assumed to be true unless there is convincing evidence to the contrary. So for your null hypothesis, you say there is no difference in the proportion of satisfied employees in London and Beijing. Your alternative hypothesis will state the opposite claim. For your alternative hypothesis, you say there is a difference in the proportion of satisfied employees in London and Beijing. Next, set the significance level or the threshold at which you will consider a result statistically significant. This is the probability of rejecting the null hypothesis when it is true. You choose a significance level of 5%, which is the company's standard for employee surveys. Now, find the p-value. In this case, your p-value is the probability of observing a difference in your sample proportions as or more extreme than the difference observed when the null hypothesis is true. Based on your sample data, the difference between the proportion of satisfied employees in London and Beijing is 10 percentage points. Your null hypothesis claims that this difference is due to chance. Your p-value is the probability of observing an absolute difference in sample proportions, that is, 10 percentage points, or greater if the null hypothesis is true. If the probability of this outcome is very unlikely, in particular, if your p-value is less than your significance level of 5%, then you will reject the null hypothesis. As a data professional, you'll almost always calculate p-value on your computer using a programming language like Python or other statistical software. To find your p-value, first calculate your test statistic z. Use the following formula to calculate the z statistic based on your sample data. z equals p1 hat minus p2 hat divided by the square root of p0 hat times open parentheses, one minus p0 hat, close parentheses, times open parentheses, one divided by n1, plus one divided by n2, close parentheses. Where p1 hat and p2 hat are the sample proportions for your first and second group, n1 and n2 are the sample sizes for your first and second group, and p0 hat is the pooled proportion. The pooled proportion is a weighted average of the proportions from your two samples. This is a separate formula that you don't need to worry about right now. If you enter the numbers in the formula and do the calculation, you get a z-score of 1.03. For a z-test, the test statistic follows a normal distribution under the null hypothesis. Recall that your alternative hypothesis states that there is a difference in the proportion of satisfied employees in London and Beijing. The observed difference is 10 percentage points. So if you find a statistically significant difference between the two proportions, either less than or greater than the observed difference of 10 percentage points, you will reject the null hypothesis. Because you're interested in values in both directions, either less than or greater than your test statistic, your p-value is the probability of getting a value less than the z-score of negative 1.03 or greater than the z-score positive 1.03. Your p-value corresponds to the area under the curve on both the left tail and the right tail of the distribution. This is a two-tailed test. If you calculate the p-value, you'll find that it's 0.3030 or 30.3%. This means that there's a 30.3% probability that the absolute difference between the proportion of satisfied employees in London and Beijing would be 10% or greater if the null hypothesis is true. To draw a conclusion, compare your p-value with the significance level. If your p-value is less than the significance level, you conclude that there is a statistically significant difference in the proportions between the two groups. In other words, you reject the null hypothesis. If your p-value is greater than the significance level, you conclude that there is not a statistically significant difference in the proportions between the two groups. In other words, you fail to reject the null hypothesis. Your p-value of 0.3030 or 30.3% is greater than the significance level of 0.05 or 5%. So you fail to reject the null hypothesis and conclude that there is not a statistically significant difference between the proportion of satisfied employees in the London office and the Beijing office. In other words, the observed difference in proportions is likely due to chance. Your analysis will help the human resources team save a lot of time and money. Since there's not a difference in employee satisfaction between the two offices, the HR team will not have to devote resources to investigating the reasons behind the difference. However, they may want to find out how to make the employee satisfaction level a bit higher. Recently, we talked about how data professionals use two sample hypothesis tests to determine whether the difference between two sample means is statistically significant or due to chance. Two sample tests are especially useful for AB tests, which often sample two groups of users to compare how they respond to different versions of a product or a website. For instance, an AB test can help you decide if an observed difference in the average click rates is due to a specific change in website design or due to chance. In this video, you'll use Python to conduct a two sample T test based on sample data. Python is helpful for conducting hypothesis tests with speed and accuracy. We'll continue with our scenario from an earlier part of the course in which you're a data professional working for the Department of Education of a large nation. Recall that you're analyzing data on the literacy rate for each district. You'll continue to use the data set you worked with before. If you need to access the data, do so now. For this video, we'll make a new change to our story. Imagine that the Department of Education asks you to collect data on mean district literacy rates for two of the nation's largest states, State 21 and State 28. State 28 has almost 40 districts and State 21 has more than 70. Due to limited time and resources, you are only able to survey 20 randomly chosen districts in each state. The Department of Education asks you to determine if the difference between the two mean district literacy rates is statistically significant or due to chance. This will help the Department decide how to distribute government funding to improve literacy. If there is a statistically significant difference, the state with a lower literacy rate may receive more funding. You can use Python to simulate taking a random sample of 20 districts in each state and conduct a two-sample t-test based on the sample data. Let's open up a Jupyter notebook and get started. Import the Python package you will use, Pandas. To save time, rename your package with an abbreviation, PD, to load the SciPy stats module, write from SciPy import stats. To start, filter your data frame for the district literacy rate data from the state's state 28 and state 21. First, name a new variable, state 21. Then, use the relational operator for equals to get the relevant data from the state name column. Now, name another variable, state 28. Follow the same procedure to get the relevant data from the state name column. Now that you've organized your data, use Python to simulate random sampling. Use the sample function to take a random sample of 20 districts from each state. First, name a variable called sampled underscore state 21. Then, enter the arguments of the sample function. N, the sample size equals 20. Replace equals true because you're sampling with replacement. For random underscore state, choose an arbitrary number for the random seed to start the computer's random number generator. How about 13,490? Recall that using the same random seed lets you generate the same set of random numbers in case you want to return to this sample data in the future. Now, name another variable sampled underscore state 28. Follow the same procedure, but this time choose a different number for the random seed. How about 39,103? You now have two random samples of 20 districts, one sample for each state. Next, use the mean function to compute the mean district literacy rate for both state 21 and state 28. State 21 has a mean district literacy rate of about 70.8%. Well, state 28 has a mean district literacy rate of about 64.6%. Based on your sample data, the observed difference between the mean district literacy rates of state 21 and state 28 is 6.2 percentage points, or 70.8 minus 64.6. At this point, you might want to conclude that state 21 has a higher overall literacy rate than state 28. However, you don't want to assume the results of your hypothesis test ahead of time. The observed difference in your sample might be due to chance or sampling variability. In other words, the difference might not be due to an actual difference in the corresponding population means. Is the observed difference statistically significant or due to chance? Your hypothesis test will help you determine the answer. Now that you've organized your data and simulated random sampling, you're ready to conduct your hypothesis test. Recall that the two-sample T test is the standard approach for comparing the means of two independent samples. Let's review the steps for conducting a hypothesis test. First, state the null and alternative hypothesis. Second, choose a significance level. Third, find the P value. Fourth, reject or fail to reject the null hypothesis. First, state your null hypothesis and alternative hypothesis. In a two-sample T test, the null hypothesis states that there's no difference between the means of your two groups. This is assumed to be true unless there's convincing evidence to the contrary. So for your null hypothesis, you say, there's no difference in the mean district literacy rates between state 21 and state 28. Your alternative hypothesis will state the contrary claim. There is a difference in the mean district literacy rates between state 21 and state 28. Next, set the significance level or the threshold at which you will consider a result statistically significant. This is the probability of rejecting the null hypothesis when it is true. The education department asks you to use their standard level of 5%, or 0.05. Now, compute the P value. In this case, your P value is the probability of observing a difference in your sample means as or more extreme than the difference observed when the null hypothesis is true. Based on your sample data, the difference between the mean district literacy rates of state 21 and state 28 is 6.2 percentage points. Your null hypothesis claims that this difference is due to chance. Your P value is the probability of observing an absolute difference in sample means that is 6.2 or greater if the null hypothesis is true. If the probability of this outcome is very unlikely, in particular, if your P value is less than your significance level of 5%, then you will reject the null hypothesis. For a two-sample T test, use the sci-pi function stats.ttest underscore end parentheses to compute your P value. This function includes the following arguments. A refers to observations from your first sample. B refers to observations from your second sample. Equal underscore var is a Boolean or true-false statement, which indicates whether the population variance of the two samples is assumed to be equal. In our example, you don't have access to data for the entire population, so you don't want to assume anything about the variance. To avoid making a wrong assumption, set equal underscore var to false. Now you're ready to write this all out in code. Start with the stats.ttest underscore end function and enter the relevant arguments. A, your first sample refers to the district literacy rate data for state 21, which is stored in the overall literacy column of your variable, sampled underscore state 21. B, your second sample, refers to the district literacy rate data for state 28, which is stored in the overall literacy column for your variable, sampled underscore state 28. Set equal underscore var to false because you don't want to assume that the two samples have the same variance. Now, run the code. The output confirms your p-value is about 0.0064 or 0.64%. This means there's only a 0.64% probability that the absolute difference between the two mean district literacy rates would be 6.2 percentage points or greater if the null hypothesis is true. In other words, it's highly unlikely that the difference in the two means is due to chance. To draw a conclusion, compare your p-value with the significance level. Remember, if your p-value is less than the significance level, you conclude that there is a statistically significant difference in the mean district literacy rates between state 21 and state 28. In other words, you reject the null hypothesis. If your p-value is greater than the significance level, you conclude that there is not a statistically significant difference in the mean district literacy rates between state 21 and state 28. In other words, you fail to reject the null hypothesis. Your p-value of 0.0064 or 0.64% is less than the significance level of 0.05 or 5%. So you reject the null hypothesis and conclude that there is a statistically significant difference between the mean district literacy rates of the two states, state 21 and state 28. Your analysis will help the department decide how to distribute government resources. Since there is a statistically significant difference in mean district literacy rates, the state with the lower literacy rate, state 28, will likely receive more resources to improve literacy. The two sample t-test is a powerful tool for investigating the differences between two sample means. And data professionals often use t-tests to help stakeholders make data-driven decisions. That's the end of your introduction to hypothesis testing. In your future career as a data professional, you'll conduct hypothesis tests to help determine the statistical significance of your results. To start, we went over the general procedure for hypothesis testing, from stating your null hypothesis and alternative hypothesis, to choosing your significance level, finding your p-value, and in rejecting or failing to reject your null hypothesis. Then we explored two different types of hypothesis tests, one sample and two sample. To understand how one sample hypothesis tests work, we conducted a one sample z-test to analyze data on the mean delivery time of an online food delivery service. Then to explore the two sample tests, we conducted a two sample z-test to compare the proportion of satisfied employees at two different offices at a construction company. Finally, you learned how to use Python's sci-pi stats module to conduct a two sample t-test to determine if the difference between two population means is statistically significant. Data professionals often use two sample tests in the context of A-B testing. Companies use A-B testing to evaluate everything from website design and mobile apps to digital ads and marketing emails. In this part of the course, we've conducted a two sample t-test to compare the mean time spent on two different versions of a landing page for a cosmetics company. A-B tests help business leaders optimize performance and improve customer experience. Soon, you'll put your Python skills to work in the portfolio project. The project will feature a realistic scenario based on an A-B test. So you're well-prepared. Coming up, you have a graded assessment. To prepare, check out the reading that lists all the new terms you've learned and feel free to revisit videos, readings, and other resources that cover key concepts. Congratulations on all your progress. Well done. Hi there, it's Tiffany. Back again to talk to you about your portfolio projects and how you can use them in your job search. Just like in your previous courses, you'll complete an independent project for your portfolio. Completing this project is a great way to present your knowledge and experience about data center tasks to potential employers. This time, your project will demonstrate what you've learned about statistics. This portfolio project is also an opportunity to develop your interview skills. As potential employers assess you as a candidate, they might ask for specific examples of how you tackle challenges in the past. You can use your portfolio as a way to discuss data problems you have solved. This project will help you consider how stats informs data professional work and it will provide a concrete example for you to discuss in future job interviews. Some employers might also ask you to complete a specific task like an AB test. In addition to having a portfolio, creating your own experimental designs means you'll be that much more prepared for those interviews. Being able to use code and calculate formulas with statistical software is important for your future career and understanding how to apply those statistical models to a workplace problem is essential for success in the data career space. To complete the portfolio project, you'll be presented with details about a business case. Then you will use the instructions to complete a new entry in your pay strategy document, simulate an AB test to compare two different versions of a product and choose the one that performs better. You will use stats to explore a data set, understand the distribution of your data and determine if your results are statistically significant. Then you will summarize your findings in a quantitative way and draw a conclusion that ties back to the business problem. By the time you complete this project, you'll have finished an AB test you can add to your portfolio. In your pay strategy document, you'll also have a record of the steps you took along the way, which you can use to explain your work to future hiring managers. Ready? Then let's get going. My name is Sean. I'm currently a product analyst at YouTube Shopping. Product analysts leverages the data that Google collects in order to understand better about how our users are doing using our products and if we can provide better services for them by making better product decisions using data. What really got me into the advanced data analytics field was actually the cliche term big data back in high school. I thought that we could use big data to understand how this world functions and even predict the future. And that was really fascinating for me when I was still a teenager. When you're looking for a data science job, in my opinion, the most important things are the following. Firstly, to showcase that you actually understand the theories as well as the practical or technical skills in data science. So when you're preparing for your proof of evidence, make sure that you have a GitHub link or a collection of past presentations that can showcase that you presented anything related to data science or data visualizations. When you are showcasing your statistical skills in your portfolio or resume, it's really important to notice that these skills are easily accessible to people online. So everybody can learn them from YouTube or from Coursera or any other online course platforms. What you really need to do is to showcase how you actually use these skills in practice. For instance, try to explain what exactly the business problem or research problem you are trying to solve and try to explain how you decided to use whatever models or solutions for that problem. Just knowing the theories or how to implement a model is not enough. Instead, what you really need to showcase is that you actually understand why you're using these models in these situations and how you're applying it. Make sure that you always highlight what is the business or research objective for the problem you're trying to solve and then explain exactly how you applied that skills and how you evaluate the results. If you don't have a college degree in data science, don't worry because the skills that you're learning from online is actually what is required in the industry. In this course, you've been learning about fundamental concepts of statistics including descriptive and inferential statistics, basic probability and probability distributions, sampling, confidence intervals, and hypothesis testing. Now it's time to put everything you've learned to work as you complete this portfolio project. In the previous course, you practiced telling stories with data. These skills will carry you forward as you complete this new project. Now that you have some practice completing portfolio projects, start thinking about how you can use stats to make an argument about the validity of a product. In this part of the course, you'll simulate an AB test for a specific company, then use statistical methods to analyze data and interpret findings about which version of the product performed better. Finally, you'll make a recommendation to the business on whether or not to implement a new version of the product based on the results. Coming up, you'll explore more of what it means to be a data professional. In other sections of this program, you'll work to develop additional skills to help you excel. There's so much more to learn about using mathematical models to analyze data sets. As a data professional, a large part of your job is analyzing data to make an informed recommendation about what direction a business should take on a project. By using stats to conduct and analyze AB tests, you'll help your future employers or clients make informed decisions about the investments they make in their company's products or services. As you progress through the program, you'll learn even more advanced techniques like regression analysis and machine learning to demonstrate the power of data analytics to improve business performance. This part of the portfolio project is a great opportunity to demonstrate to potential employers why you would be a valuable addition to their team.