 This video will cover introductory information about business analytics, including a definition of business analytics, the wisdom hierarchy, data sources, and business analytics terms, applications, history, and uses. Business analytics is the process of transforming data into insights that support, improve, and or automate business decisions. The data can be of many types and from a variety of sources, and there are many techniques and software packages that can be used for an analysis. A simple way to think of business analytics is that it's the tools and processes used to find value in data, to transform the raw data into information that can be acted upon. A common method of understanding the relationship between data, information, knowledge, and wisdom is by using a pyramid. This graphic is called the DIKW pyramid or the wisdom hierarchy, and it illustrates that data is the foundation upon which decisions can be made. Information is defined in terms of data, knowledge is defined in terms of information, and wisdom is defined in terms of knowledge. Data are numbers or text without any context. Information provides meaning from data, often combining multiple data points to produce a tangible idea. Knowledge provides context from the information, making it directly applicable to a situation. Wisdom applies the knowledge to make a decision. The original data has become useful, enabling an action to be taken. So where does data come from? Data can come from a variety of sources, both internal and external. Internal data is collected by businesses and stored within their own servers. This data can be generated in a number of ways, either by physical objects such as sensors or barcodes, or by using computer software such as websites. While collecting and storing proprietary data sources is still very common, there are now many external data sources that businesses can use. Public domain data sources such as government surveys or social media posts can be accessed, and there are also many services that offer paid data sources, like stock market or weather data. These external data sources can be combined with a company's proprietary data to build a more complete picture of reality. The amount of data that is being generated and collected is growing exponentially. This growth is occurring due to the similar exponential trend in computing power, along with a decrease in costs for digital storage. Because so much new data is being created and captured every year, there's a corresponding growth of demand for business analysts. There are many terms that are synonymous or semi-synonymous to business analytics. In the past few years, data science has become the most common term to be used to describe this field. Other terms are often used interchangeably, such as business intelligence, big data, data mining, knowledge discovery, and machine learning. While there are many discussions about which term to use in which scenario, all of the terms refer to the overall concept of using data to make better decisions or better products. Business analytics is an interdisciplinary subject based heavily on math and statistics. It uses computer science principles and algorithms. The math and computer science concepts are applied to a specific subject. Business analytics has applications in every field. Data on all our clicks on the internet are used by retailers to figure out our preferences and provide targeted advertising. Analyzing data on past marketing campaigns provides a useful base to determine who to target for the next round. Supply chains in almost all industries have become far more efficient due to analytics. Organizations use historical data to figure out optimal numbers for staffing and hiring. Companies use demand data to determine optimal prices. And professional sports teams use analytics, such as when they look at player performance, to determine who would be the best draft pick. No matter which industry you work in, there's room for analytics to add value. Although the popularity of analytics has recently peaked, the field is nowhere as new as certain companies or people make it out to be. Since the computer was invented, it was being used to process data to solve problems from decoding messages in World War II to generating weather forecasts in 1950 to modeling credit risk in 1958. Of course, these tasks involved enormous computational costs, and only organizations with the most resources could attempt them. Toward the end of the 20th century, as computing power became more affordable, more organizations began collecting and storing data. The types of analytical projects transitioned from being historical in nature to real-time. In 1992, the first real-time credit card fraud system was introduced. Then the first analytically-centric companies emerged. Companies such as Google used data to build their core product, while other companies such as Amazon used analytical techniques to earn market share from competitors. The rapid ascension of these tech companies has led to an arms race where all businesses have become committed to analytics. Businesses use analytics to gain an edge on their competitors and increase profits. The three main areas in which they do this are competition to increase revenue from products or services sold, efficiency to reduce the cost of resources or internal processes, and customer satisfaction to improve the customer experience and encourage customer loyalty. Here's a case study as an example. Loyalty cards are used by grocery stores to uniquely identify their customers. By requiring a loyalty card to obtain special discounts in the store, the grocer can isolate habits of each customer and then provide customers with customized promotions to increase spending. When a customer stops frequenting the store, the grocer can mail coupons with aggressive offers. The layout of a grocery store is constantly being changed to maximize customer spending. This is why the milk section is always on the opposite side of the produce section, so customers will have to traverse past every aisle to get to the two most commonly bought items. Each shelf is also analyzed to find the ideal arrangement. More expensive items are typically placed at or around eye level, while the cheaper products will be on the top or bottom shelves. Optimizing prices is another analytical technique used to maximize customer spending. Many grocery stores will have what are called loss leaders, products that are very cheap to draw customers into the store where they will inevitably spend more on other overpriced items. Grocers will also find the ideal times and prices to mark down expiring products, preventing the product from being thrown away at a complete loss. This concludes our introductory video about business analytics. Today we covered a definition of business analytics, the wisdom hierarchy, data sources, and business analytics terms, applications, history, and uses. This video will cover introductory information about business analytics, including the role of the business analyst, what makes a business analyst successful, and an overview of business analytics tools, project outcomes, and the analytical process. There are three main reasons business analytics is an enticing career choice. The first is that there is a high demand for business analysts, and the relatively low supply of skilled workers means that salaries are higher in this field. Another reason is for the challenge of solving interesting problems. Analysts are typically people who are interested in solving complex puzzles. Finally, those who are curious about how things work can use their skills in analyzing data to uncover previously unknown truths. A business analyst can take many roles depending on the data and the type of project. The most common roles are that of an interpreter in which the analyst uses descriptive analytics to tell the story of what happened, an oracle in which analysts use predictive analytics to predict future events, and a console in which the analyst uses prescriptive analytics to provide advice on the best course of action. An analyst becomes successful due to a combination of hard and soft skills. The hard skills are more tangible and refer to what the analyst can do with what tools, while the soft skills are less flashy on a resume, but equally or even more important than the hard skills. Analytical tools can be separated into two categories, software that requires coding, and software in a graphical user interface, or GUI, that is based on point-and-click interaction. The main benefit of writing code is that it allows for more flexibility. There are more features and it allows for more possibilities. The drawback of coding is the extended learning curve. Within the last decade, there have been many new GUI programs that make analytics easier to implement without the need for writing code. Programs such as Tableau, Alterics, and RapidMiner have started gaining market share, going along with older tools such as SAS Enterprise Guide, but none of these tools have yet to replace the overwhelming popularity of code-based software such as SAS, R, Python, or SQL. There are many different goals that an analytics project can strive for. Typically, these goals fit into one of two categories. The first is providing information about a business such as reports and dashboards for business stakeholders. Reports or presentations provide one-time insights to explain events that have occurred and predict future events, and dashboards are used by stakeholders for ongoing monitoring of key aspects of the business. The second category is the production of analytical products. In these types of projects, the business's data becomes the input for a complex process that automatically produces an action. This can take the form of features that offer a better experience for consumers. For instance, Amazon has an automated algorithm that determines products you might like to buy. Analytical products can also be built to make internal business processes more efficient. An example of this is how credit card companies test every transaction for the probability of fraud. An analytical project should start with the goals well-defined. Very rarely are projects started to simply explore the data or find hidden truths. Collecting an inventory of all the relevant data sources is also an important step in the beginning of a project. Finally, every analysis should begin with an effort to better understand the data. This should be done before any analytical techniques are used. Simply observe the data files and write down a list of observations, questions, and any other ideas you have. This process, called creating disfluency, enhances the data dictionary and helps the analyst internalize elements of the data. Cognitive disfluency is the principle that enables students who take lecture notes by writing to retain more of the material than students who type notes, even though those who type can take notes more efficiently. Sometimes the more work we have to do to process the information, the better we can understand the information. Using this principle at the beginning of a project, before using any advanced analytical techniques, enhances the analyst's capability to understand the data. After the initial stages of a project, there are a few more steps in the analytical process. The majority of time is spent exploring and preparing data. In the exploration stage, the analyst learns more about the variables, including distributions and frequency of values, and also identifies variables with missing or null values. In the preparation stage, the data becomes more useful as variables are transformed and anomalous values are rectified. This is referred to as data cleaning. Another common task during this stage is to join data into as few sources as possible. Perhaps one of the most valuable techniques to use during the preparation stage is called feature engineering, in which the analyst transforms certain variables into new variables containing slightly different data. This allows hidden aspects of variables to be analyzed. For example, a data set with a date variable can have a new variable added to determine whether that date is on a weekday or the weekend, and this new variable could add important information that was not readily available in the original data source. This is one of the areas of the analytical process where creativity is needed. The last two steps in the analytical process are to build models and then to put these models into production. A model is a type of mathematical equation that describes relationships among variables in a data set, often for the purpose of predicting an outcome. By putting a model in production, an automated decision can be made when new data is observed. In some cases, models are not the goal of a project. Rather, the goal is to analyze data and communicate the findings in an analytical report, presentation, or dashboard. Visualization of the data is also a key component throughout the entire analytical process, as the analyst attempts to learn more about the data. This concludes our introductory video about business analytics. Today we covered the role of the business analyst, what makes a business analyst successful, and an overview of tools, project outcomes, and the analytical process. This video will cover some of the foundations of business analytics, selecting, filtering, and sorting. There are many synonymous terms to describe the aspects of a typical data file, which can be referred to as a table, spreadsheet, data set, or data source. Along the horizontal axis are the variables, which can also be called fields, attributes, or columns. Along the vertical axis are the observations, also referred to as records, tuples, or rows. Fields that are sparsely populated, meaning that a high percentage of records is missing, should not be selected. Fields with redundant values, meaning that a high percentage of records have the same value, should also not be selected. Techniques to identify these fields vary. For now, a visual inspection of the first few records can be used. From this table, we can remove the state column because all of its values are the same. We can remove the religion column because it is sparsely populated. Our table now depicts only the variables of interest. Once the variables of interest have been selected, they can be renamed if needed and assigned the proper data type. There are two common variable types, numeric and categorical. Numeric variables include discrete variables, which can include whole numbers used for counting and IDs that represent a unique entity, and continuous variables, which are numbers with decimals used for measuring. Categorical variables include text, which is any combination of letters, numbers and symbols in strings or characters, and boolean or binary variables, which contain only one of two possible values. Variables that can be treated as either numeric or categorical include dates, such as date, time, date or time, and spatial objects, which show location, such as latitude and longitude. Here are the variable types for our table. Person ID is a discrete numeric variable. Gender and city are string variables. Weight is a fixed decimal variable. Date of birth is a date, and student is boolean. Software packages can automatically assign fields with data types, and sometimes these can be assigned incorrectly. Examination of these data types is necessary to ensure proper utilization for their downstream. Sometimes data formats require advanced data transformation and extraction techniques. For example, dates can have varying formats that can be interpreted as strings, and parts of these strings need to be separated before the software can identify as a date field. For now, we'll assume that you can change a field's data type without these advanced techniques. The final aspect of the select step is to assign the proper size for each variable. Many software packages can do this automatically, but it's still a good idea to review the sizes. Because the size of each field has a direct impact on the amount of storage required for a data set, specifying these can save system resources and processing time. A common issue is for one record that contains many more characters than the typical record to cause the size to be much larger than need be. A visual inspection of the table can highlight these instances, but beware that changing a variable's size can truncate or cut off some of the values. While selecting limits a data set's size by omitting certain columns, a filter limits a data set's size by omitting certain rows. A filter is also known as a condition, subset, or in SQL, a wear clause, and it's commonly used for investigative purposes. It's important to be aware of the information contained within the row of a data table. Also commonly referred to as a record or observation, a row defines the level of detail that is contained in the data set. For this example, each row represents one person. An ID field, such as person ID, in which no rows contain the same ID, can be used to determine that the table's level of detail is a person. If there are multiple rows for the same ID, we know that the rows reflect data from the same person. In this example, the level of detail is more granular. It shows a record of one person's weight by day. Finding and understanding the level of detail for a table is necessary before analyzing the data. After determining the table's level of detail, you should check for duplicate observations. These are rows of data in which all values are exactly the same as another row. In some cases, duplicate observations may be legitimate, but usually these are erroneous and need to be removed. Here's our table with the erroneous observation removed. This table is filtered to show only records of people who live in Raleigh. Let's try another filter. Here's our original data set again. This table is filtered to show only records of people who weigh more than 180 pounds. Filters of different variables can be applied together. Combining the above examples, we can filter the original table for people who live in Raleigh and weigh more than 180. This results in only one observation. The next step is sorting. When we sort, we rearrange a table by ordering the rows according to the values of one or more fields in either ascending or descending order. Here's our original data set. It's sorted by date of birth in ascending order. Here we've sorted by city in ascending order and then by weight in descending order. This concludes our video on selecting, filtering, and sorting. This video will cover two types of formulas, same row and multi-row. A commonly used method of data preparation is using existing fields to create new variables. Examples include adding, subtracting, multiplying, dividing, or applying another mathematical function to a numeric field, extracting and transforming substrings from a text field, extracting, truncating, and parsing date parts from a date field, conditional statements using if then and binning to create new variables, and comparing values of two different fields to create a Boolean variable. Formulas can be used for adding, subtracting, multiplying, or dividing two numeric fields. In this table, we apply a formula to create the price column, which was calculated by dividing units sold by total amount. Formulas can also be used when applying a mathematical function to a numeric field. Examples of mathematical functions are average, floor, finding the smallest value, ceiling, finding the largest value, square, square root, absolute value, trigonometric functions, and logarithmic functions. In this table, we will create an observation average column using the average formula, which averaged values from the observation 1 and observation 2 columns. Formulas can also be used to extract and transform strings from a text field. There are a few common functions for transforming text, and they are typically named differently depending on the software package being used. The most common of these are, up or lower, used to change the field to all upper or lower case characters, concatenate, which combines two or more strings into one field, substring, used to extract a portion of a string, trim, which can be used to remove certain characters, usually spaces, from a field, index, used to find the location of a certain string within a field, and length, which finds how many characters are in a field. These are the basic string functions that almost every software package will include, and can be combined to complete almost any string transformation. For the following example, we will use two functions to transform a field that holds a city and state into two fields. To extract city and state from the airport location field, we would need to extract all the text before the comma as the city field, and everything after the comma and space as the state field. To do this, we need two functions, index and substring. The index function will help us identify the location of the comma in the text field, counting from the left, including spaces. This is done because the comma is not always in the same location. Then we use the substring function to extract everything before the comma as the formula for the city field, and extract everything after the comma and space for the state field. To get the city and state columns back into the format of the original field, we would use a concatenate function. We can also use formulas for extracting, truncating, and parsing date parts from a date field. A date or date time field carries more information than a typical variable. The time of day, the day of week, the day of month, the month of year, the year number, and the difference in the date from other dates, such as today. Extracting this and other information out of a date field is a transforming technique that can add to the value of a data set during analysis and reporting. The most heavily used date functions are today, used to get today's date, date part used to get a part of a date. For example, getting the month of date will return the numeric value of the month. Truncate, which will return a date value at the beginning of a period specified, for example, truncating a date at the month level, will return the first day of the month for the date. Date add, which will add or subtract a certain number of periods to a date, and date difference, used to calculate the number of periods between two dates. This table contains the major holidays for 2016. For each date value, we can extract other features. We can add a column with a numeric value to describe the day of the week. We can subtract today's date from the date column to find the number of days remaining until the holiday occurs. With formulas, we can also compare values of two different fields to create a Boolean variable. A Boolean variable is typically used to capture true-false values. To create a Boolean variable, a logical statement or condition can be used in a formula. This logical statement can compare two values of the same type, numeric values, strings, and dates. This table depicts the population of states. We can use a formula to create a column that displays whether the state is on the east coast. We can also create a column to display whether the state has a high population. Formulas can also be used with conditional statements such as if then and when binning to create new variables. To create a variable with multiple possible outcomes, conditional statements are needed. These usually take the form of if condition one is met, then outcome one. Else if condition two is met, then outcome two. Else outcome three. While the conditions within these statements are the same as Boolean formulas, these formulas offer more flexibility as the analysts can assign any value when the condition is met and can use as many conditions and outcomes as needed. While there are many applications of conditional statements, one of the most common is to bin or tile numeric variables, a process which transforms the continuous numeric variables into categorical variables. In this table, we compute sales volume with if then logic. If sales is greater than 40,000, then high. Else if sales is greater than 20,000, then medium. Else low. This formula begins with testing the first condition. If sales is greater than 40,000, and if it is true, the first outcome high will be assigned. The second condition, else if less than 20,000, is evaluated only if the first condition is false. If the second condition is true, the second outcome medium will be assigned. If neither the first nor second conditions are true, then the final outcome low will be assigned. Sometimes data from a separate row or rows needs to be used when creating variables. For this, a multi-row formula is needed. A running total is the most basic multi-row formula. It adds the value of a field of a current row to the value of a previous row. This can be done across an entire data set for a particular variable, or it can be grouped by certain dimensions. A lag value looks at the data in preceding rows, while lead values look at data in subsequent rows, and is the opposite of lag. Window functions provide the ability to perform calculations like sum, average, and rank across sets of rows that are related to the current query row. This is equivalent to aggregating the data set across one or more dimensions than joining the resulting data set back to the original. This should be used carefully as values will be repeated for all levels of the dimension. An example of this is in the transportation industry where a row typically represents one leg of a trip. After sorting the data set properly, the lag function can be used to combine rows so that the data describing the round trip is displayed. Total price is a running total of ticket price, grouped by passenger ID. Notice how upon reaching the first row for a passenger ID, the sum starts over, and total price equals ticket price. Total flight time is a running total of flight time, grouped by passenger ID. Lag of destination uses the lag function, which doesn't consider any variables other than destination. This means it's dependent on the sort order of the data set. Sum of ticket price is a window function, which is the sum of the ticket price partitioned or grouped by passenger ID. This concludes our video on single and multi-row formulas. This video will cover unions and joins. Union is also referred to as a pending, concatenating, or combining two tables vertically, or simply adding rows. The tables should have the same variable names, types, and sizes. Variables that are in only one table will receive null or missing values for the rows from the tables that do not contain them. In this union, our two tables, one with three rows and one with one row, will combine to create one larger table, which has four rows. Notice the null values in the storage column. Storage was not contained in both original tables, so the rows that do not contain data for storage will have null or missing values. Joining is also referred to as merging. Very rarely is data collected and stored in a format that's ideal for analysis. Typically, data is stored across multiple tables in a relational database schema like you see here. Schemas are designed to optimize for storage, so values that repeat often and take up more storage, such as names, are reduced to representative IDs that take up much less storage. In a schema, the tables that contain identifying information are called dimension tables and can sometimes be referred to as lookup tables. Our diagram has three. So for example, the product dimension table contains the name, manufacturer, and type of product. Each dimension table has a primary key that is used to identify a unique record. The fact table is a record of events that happen for a combination of dimensions. It's comprised of two things, foreign keys and measures. Foreign keys map to primary keys of dimension tables. These are the fields that will be used to join the tables together. For example, in this diagram, the foreign keys are person ID, store ID, and product ID, which each connect to their own dimension table. Measures can be any type of variable we've already discussed. In this diagram, we have two measures, units sold and total amount. To join these tables together, we will use an inner join. This will return only rows that are contained in both tables. When a fact table has null or missing values for a dimension, using an inner join will remove the entire row. Since the row may contain useful data in other fields, it's better to use a left outer join to keep all values from the left table, in this case the fact table, and the values from the dimension table that match on the key. While there are other types of joins as shown here, inner and left outer joins are the most commonly used. First, we will join transaction fact with person dimension. Next, we will join the results from the previous join with store dimension. Finally, we will join our results from the previous join with product dimension. Notice how each step creates a progressively larger table. This concludes our video on unions and joins. This video will cover a definition of aggregation, as well as examples of aggregation. Aggregation is also referred to as a pivot table, group by statement, or summarize. Aggregation transforms data into lower dimensions using summing, averaging, and counting. The benefits of this are to answer basic questions of data sets with many different dimensions. The most basic aggregation can be done on the level of detail for the table. In this case, each rows a transaction, and the table as a whole represents sales for phones for quarter one of 2015. The answers can be calculated by applying a formula to a single field using all the rows in the table. How many units were sold? 9. How much total revenue? $3,590. How many transactions? 6. How many distinct products were sold? 3. How many customers? 3. How many stores had sales? 4. What was the average price? $398.89. What was the average number of units sold per transaction? 1.5. What was the average amount spent per transaction? $598.33. What was the average amount spent per customer? $1,196.67. Pay special attention to how the average function works. It uses the sum of the field divided by the number of rows, so it can't be used in averages such as this, where the denominator is not the number of rows. The next questions are related to aggregating one dimension at a time and can be calculated in one step. For each of the questions, the original table is grouped by the dimension of interest, either the person ID, store ID, product ID, or date. For each value of the dimension chosen, calculations are performed using the measures, the number of rows, and the number of distinct IDs. A new data table is the result of these calculations. Notice that when grouping by a dimension, the dimension is sorted in the output. For this example, we'll aggregate at the person level of detail to answer the following questions. How many units did each person buy? How much money did each person spend? For each person ID, we need to calculate the sum of units sold and the sum of the total amount. Starting with person ID 1, we see that there were 1 plus 1 equals 2 units sold and 365 plus 425 equals $790 spent. These values are populated in the resulting table. Similar calculations are done for every person ID in the original table. Aggregating at the store level of detail can provide basic information about store performance. To find the number of transactions each store had, each occurrence of a store ID needs to be counted. Store ID 101 has two rows in the original table, indicating that there were two total transactions. Each store ID has its number of rows counted, and the final result is the numTransactions column in the newly created table. Aggregating at the product level of detail can answer basic questions about each product, such as What was the average price of each product? How many distinct customers bought each product? Starting with product ID 1001, average price is calculated as the sum of the total amount, 1625, divided by the sum of units sold, 4. The final value of 40625 is populated into the resulting table. Again, be aware that using an average function in this case would not give us the expected result. Distinct customers is calculated by counting the distinct number of person IDs for product ID 1001. There are two, person ID 1 and person ID 2. The calculations are then completed for every product ID. The final example uses the month and year components of the date field to calculate all possible levels of aggregation, which were shown on the previous slides. How many units were sold each month? How much revenue each month? How many transactions each month? What was the average price? How many distinct products were sold each month? The final type of questions that can be asked involve combinations of multiple dimensions. One of the most common combinations is to aggregate by a date dimension along with another dimension. For this calculation, each combination of the dimensions chosen for grouping are performed. In the transaction table, there are four such combinations. Transactions for person ID 1 are only in March of 2015. There's only one transaction for person ID 2 occurring in February 2015. And there are three transactions for person ID 4, two occurring in January, and one occurring in February of 2015. Within each of these combinations, each of the calculations are performed. Here we'll show you the first row as an example. Note that the more unique dimensions a data table contains, the more the number of possible levels of aggregations increases. In this example table, with four unique dimensions, there are 14 different possible levels of aggregation. Each different level of aggregation can offer a unique insight, but it's best to start with two or fewer levels. Example questions that can be answered using the example transaction data set include which customers tend to buy which products, which stores to customers tend to frequent, what products sell well in what stores, and during what months do they sell more on average. Here's one final example using multiple concepts that have been covered. Using the transaction fact table, we want to answer the question which stores had products that were on sale. Answering this question is a multi-step process. The first step is to find the average price for each product. For this, we aggregate by product ID, and for each product ID, we calculate the average price. For example, product 1001 sold for $425 and $400. The average of these two values is calculated and then entered into the resulting table. Values for subsequent IDs are calculated in a similar manner. We also need to aggregate at the store and product level. This removes the customer and date level information, giving us the totals for each store and product combination. Due to the limitations of this example, the aggregation does not result in fewer rows as it normally would. The resulting table is, however, sorted by store ID and product ID. The next step is to join the product aggregate table to the store product aggregate table using product ID as the common field between the two tables. This results in a table with a similar structure as the store product aggregate table, with the only difference being the new column, average product price. We filter the resulting table to only include rows where price is less than average product price. There are only two rows where this is true. Finally, we join this table with the store dimension and product dimension tables. Adding store name and product name to the table is necessary to fully answer the original question. The resulting table makes it clear that the G2 went on sale at Best Buy and the S6 went on sale at HHGreg. This example is a simplified illustration of how data preparation techniques are often used together to answer typical questions of the data. This concludes our video on aggregation. Today we covered a definition of aggregation and examples of aggregation. This video will cover cross tabs and transposing. Cross tabs are also called pivot tables. They are used to compare one measure across two or more dimensions. One dimension will have its values transformed into columns and the other dimensions will be aggregated. The visual result will be a table with fewer rows but more columns. For example, this table shows sales by region and month. The data is stored in a form similar to how it would be captured with one row for every combination of region and month. A cross tab function is performed with region defined as the grouping field, month defined as the header field, and sales defined as the data field. There can be multiple grouping fields but only one header field and one data field are allowed. Cross tabs are an effective way to summarize and present data as trends within the data are easier to identify. When data visualization, color is added to build a heat map. Cross tabs are also used in statistics to build contingency tables. Transposing can be thought of as the opposite of cross tabs. It transforms the data from a wide format into a narrow format. Typically transposing is required after receiving financial data from a spreadsheet as columns can be used to capture dates, locations, or categories. Once the columns known as data fields are transposed into one variable, the data is ready for a more sophisticated analysis. Any fields that are not to be transposed are called key fields. For this example, we can use the result of the cross tab function from the previous section. Region is the only key field while January, February, and March are the data fields to be transposed. Transposing results in the original data set. Notice how the column names are transformed into values for the newly created month column. This concludes our video on cross tabs and transposing. This video will cover contingency tables. First we will discuss the definition of a contingency table, and then the steps for creating one. Finally we will discuss chi-square distributions. In statistics, a contingency table is a type of table in a matrix format that displays the multivariate frequency distribution of variables. Contingency tables are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two or more variables and can help find interactions. A contingency table is also referred to as a two-way frequency table. Here's an example. Given this table, can you calculate the following metrics, the number of males who are right-handed, the percent of males who are left-handed, whether more males are left-handed than females, or the percent of left-handed people who are females? In a table such as this, notice that there are no numerical values. The person ID is a numerical identifier, but the numbers are arbitrary, so there are no obvious calculations to perform. However, because each row represents one person, we can count the number of rows along each dimension, gender and dominant hand, to analyze how the dimensions are distributed. To create a contingency table, the first step is to aggregate the original data set along two dimensions, gender and handedness, and count the number of rows for each combination of values. Since there are two possible values for gender, male and female, and two possible values for dominant hand, left and right, there are four total possible combinations, male and right, male and left, female and right, and female and left. For each of these combinations, we count the number of rows that contain both values. At this point, we can extract information regarding the quantity of each combination, allowing us to answer basic questions such as how many females are left-handed, and are there more males or females who are left-handed? Next, we perform a crosstab on the result, moving the dominant hand to the horizontal axis. This will make interpretation of the results much easier and allow us to easily calculate totals along each dimension. There are other questions, however, that are harder to answer, and these deal with the proportions of each combination. What proportion or percent of males are left-handed, and are a greater proportion of males left-handed than the proportion of females? For these questions, the proportion of gender, we need to divide the values in the cells by the totals along the gender axis, which in this case is the vertical axis. The resulting table, called a row conditional frequency table, will have 100% values in each cell in the total column. The cells for each row will account for all of the people within the category male and female of the gender dimension. The table can be easily read, 83% of males are right-handed, and only 8% of females are left-handed. But what if we want to know what percent of left-handed people are female? This is a different type of question, one that can be answered by dividing the original values of the cells by the totals along the dominant hand or horizontal axis of the contingency table. These are called column relative frequencies. Now the results can be interpreted into sentences by focusing on the columns in the contingency table. 49% of right-handed people are males, 31% of left-handed people are females, etc. But what if we want to answer questions about how prevalent each separate combination is among all people in this data set? For instance, what percent of the people are left-handed males? For that, we divide the original values by the total number of rows in the data set. For our example, this makes for an easy calculation since there were 100 rows. With this table, we can answer many basic questions about the data set including what percent of the people in the data set are right-handed, what percent of the people in the data set are female, what percent of the people in the data set are left-handed females, and so forth. Note that the language used what percent of people in the data set is different from the easier to say what percent of people. This is because the data set being used is a sample of the population, and to infer the trends about the population, we'll need to use a statistical technique. For a contingency table, a chi-square distribution can be used to make such an inference. To do so, however, assumes that the sample you are using was acquired from the population randomly. The chi-square distribution compares the actual values to the expected values to determine if the actual numbers that were observed and recorded in the data set are due to chance, or if there's a difference between the two variables that cannot be explained by chance. In this example, we want to determine if the observed difference in the proportion of females who are left-handed is really smaller than the proportion of males who are left-handed, or if that observation could be due to chance. In other words, we want to know if the dominant hand is dependent on gender. To calculate the expected values for each cell, multiply the relative horizontal and vertical dimension totals, and divide by the number of total observations. In this example, to calculate the expected value for right-handed males, multiply 52, the total number of males, by 87, the total number of right-handed people, and divide by 100, the total number of people. Similar calculations are done for each cell. Once the expected values have been computed, the chi-square test can be run. In Excel, the chi-square test function uses the actual table and expected table as inputs to calculate the p-value. This p-value is the probability that these results did not occur due to chance. A common way to evaluate the p-value is to compare it to 0.05. If the p-value is less than 0.05, we say there is an association between the two variables that is statistically significant. In this example, our p-value is 0.18, which is greater than 0.05, so we conclude that the dominant hand variable is independent of gender. If there is an association between two variables, completing the calculation for the chi-square test statistic can be used to find which values contribute the most to the association. This can be calculated using this formula. For each cell, subtract the expected value from the observed and square the result, then divide the answer by the expected value. Each of these results is summed, indicated by the Greek letter sigma. This value is compared to the chi-square distribution. Within each cell, the calculation describes how far the actual value is from the expected value. In this example, the calculations for left-handed people are much greater than those for right-handed people. These cells have the greatest impact on the potential association between the two variables. Summing those values results in chi-square test statistic of 1.78. Using this value in a chi-square distribution, along with the degrees of freedom based on the number of values for each variable, and a significance level, such as 0.05, is how the p-value is obtained. This concludes our video on contingency tables. Today we defined contingency tables, and then we discussed the steps for creating one. And finally we covered chi-square distributions. This video will first define distribution, then we will cover measures of distribution. The mean, median, outliers, mode, minimum and maximum values, and quantiles. The most common method of analyzing a numeric variable is by exploring how the values are distributed. The distribution of a numeric variable shows all the possible values and how often they occur. A distribution provides methods in which many records of data can be summarized to provide basic information about the variable. These methods can either be numerical measures or visualizations. We'll explore the most popular methods using this data set of 10 rows of bank teller salary data. Because the sample data set is so small, we can make some quick observations about the salary variable. The lowest salary is $28,665. This is referred to as the minimum value. The highest salary is $44,020, the maximum value. Finally, many of the salaries are in the low $30,000 range. We'll begin by defining and calculating the salary variable's measures of distribution, mean, median, outliers, mode, minimum and maximum values, and quantiles. To find the mean or average, add up all the numbers and divide by the number of rows. $341,860 divided by 10 equals a mean of $34,186. To find the median, sort the numbers and find the middle value. If there are an odd number of rows, there is one middle value. If there are an even number of rows, there are two middle values, and the median will be the average of these two values. Here we sort the table by salary, then find the average of the two middle values by adding $33,980 to $34,850 and dividing by 2 to get $34,145. Values that fall outside of the normal range of the rest of the observations are called outliers. In our example, the value of $44,020 is an outlier from the rest of the values. The mode is the most commonly occurring value. In our sample data set, no values occur more than once, so there is no mode. As we've already noticed, the minimum and maximum values are $28,665 and $44,020. A quantile is a set of values that divide a frequency distribution into equal groups, each containing the same fraction of the total population. To find quantiles, divide the distribution into groups of equal size, with each group containing about the same number of rows. The most simple of quantiles has already been calculated. The median divides the records into two groups of five. The following quantiles are most often used, tertials, three groups, quartiles, four groups, quintiles, five groups, desiles, ten groups, and percentiles, one hundred groups. Let's look at quartiles for our example. The median is commonly referred to as Q2, since it is the second quartile. Q1 and Q3 are also used often. To calculate quartile one, we can look at the middle value of the first five records sorted in order, which is $31,300, and then look at the middle value of the last five records for quartile three, which is $35,100. These values, along with our median, are our quartile values. This concludes our video on measures of distribution. Today we define distribution, and then we discuss the measures of distribution, including mean, median, outliers, mode, minimum and maximum values, and quantiles. This video will cover variation, including a definition of variation and the measures of variation, range, interquartile range, variance, standard deviation, and standardization, as well as testing differences between means. Variation refers to how spread out the values are for a variable. Interpreting variation, that is explaining a variable's variation in reference to other variables, is a foundational task in business analytics. Variables that have values that are spread out have higher variation, while variables with values very close to the mean have lower variation. The following measures help us understand how the values for a variable are spread out around the mean. Range, interquartile range, variance, standard deviation, and standardization. Using an example of salaries by gender, a question arises. Do men make more than women? Not only some men make more than women, but is this true for the group as a whole? Using the principles of variation will help us to answer this question. The range is the largest number minus the smallest number. To find the range, we first sort from smallest to largest. Then we subtract the smallest value from the largest value. To find the interquartile range, first separate the data into quartiles and then subtract Q1 from Q3. In the variance, we first calculate the mean, or average. Then we subtract the mean from each value and square the result. By squaring the differences, we remove the possibility of negative values canceling out the positive values. Next, find the average of the squared differences. Variance is not a very useful measure. The value we got for variance is much different from the range of salaries that our variable captures. To get it back into the correct scale, we take the square root of the variance. This results in the standard deviation. This value is used relative to the mean to determine which values are within the normal variation of the salary variable. Subtract the standard deviation from the mean to find the lower threshold. Add the standard deviation to the mean to find the upper threshold. Using the standard deviation gives us a way to determine which values are within the normal variation and which values are either less than normal or greater than normal. In other words, it gives us the ability to identify outliers more easily. In our example, we can identify that the smallest two salaries, $28,665 and $29,500 and the largest salary, $44,020, are outside one standard deviation from the mean. Mean and standard deviation are measures that describe the distribution of a set of numbers. But what if we want to describe the numbers within the set to compare to numbers within an entirely different set? We can do this by standardizing the values. In statistics, this measure is called the Z score. To standardize a value, subtract the mean and divide by the standard deviation. For example, in the first row of our data set, since we've already calculated the salary minus mean column, all we need to do is divide by 4,103 to get the result of negative 1.14. This value can be interpreted as the value of $29,500 is 1.14 standard deviations lower than the mean. Looking at the rest of the values, we see that the largest salary, $44,020, is actually very far away from the mean. This must be a high-performing bank teller. We can use this standardized value to compare the high-performing bank teller to the high performer of another profession. Let's return to the question posed at the beginning. Do men make more than women? To answer this question, we can use a statistical test called a t-test. This test requires that we make three assumptions about our data, which for now we will assume to be true. The two populations have the same variance. The populations are normally distributed. Each value is sampled independently from each other value. By considering the values of salary separately, we can see that females have the lowest value and the highest value, so the variation must be higher. Calculating the mean and standard deviation for each group, we see that the mean is almost the same, yet the standard deviation for females is more than twice that for males. Using these values, we can calculate the t-statistic and corresponding p-value, but in Excel, we only need the original values in different columns. Using the t-test function, we get a p-value of .44. Since this value is much higher than .05, we conclude that the two distributions are not statistically significant. This is mainly due to the fact that our sample size for this example is very small, and also because the means are almost equivalent. This concludes our video on measures of variation. Today we defined variation and then discussed range, interquartile range, variance, standard deviation, standardization, and testing differences between the means. This video will cover distribution visualizations, including buckets, histograms, and an introduction to area line graphs. We begin with a dataset that contains gender, dominant hand, and salary. We have calculated a number of different measures, including the mean, median, minimum, maximum, range, interquartile range, variance, standard deviation, and a list of Z scores. Next, we will build a histogram, which is the most common way to visualize a numeric distribution. To build a distribution, we will group the data into buckets or bins of equal size. This will effectively transform the variable into a categorical variable, allowing us to count the occurrences of each bucket. Next, we determine the size of each bucket. First, we locate our minimum and maximum values. Then we subtract the minimum from the maximum and divide by the number of buckets. This gives us a value of $1,536. Next, we need to figure out the starting value for each bucket. Starting with the lowest salary, we add the size of the bucket to determine the starting value of the second bucket. We do this until we obtain the starting values for all 10 buckets. We'll add a new column to the original table to assign each person into a salary bucket. Summarize the new variable, starting value of salary bucket, counting how many occurrences of each bucket are observed. Note that although this variable is numeric, by summarizing it, we're treating it as a categorical variable. Finally, we'll make a simple bar chart to visualize this table. This bar chart is known as a histogram, and it's one of the most common methods of visualizing a numeric distribution. Although many software packages can produce a histogram very quickly, as we've seen, creating one involves quite a few steps. If we change the visualization to an area line graph and change the vertical axis to the percent of records by dividing the count by the total number of records, the result is similar to what is known as a probability density function, or PDF. This is commonly used in statistics to estimate the probability of a new value. For now, we just need to know that the shaded area under the curve adds up to 1 or 100%, and that the lines are typically much smoother than we see in this example, which is because most PDFs visualize more than 10 data points and use more than 10 buckets. This concludes our video on distribution visualizations. Today we covered buckets, histograms, as well as a brief introduction to the use of area line graphs. This video will cover normal distributions, continuous distributions, density functions, cumulative distribution functions, and the 689599.7 rule. The single most important distribution in statistics is the normal distribution. It's a continuous distribution, and it's the basis of the familiar symmetric bell-shaped curve. The mean of the normal distribution is in the center. The standard deviations are marked at equal distances from the mean. Any particular normal distribution is specified by its mean and standard deviation. By changing the mean, the normal curve shifts to the left. Or right. By changing how spread out the standard deviations are, the curve also changes. Standard deviations can be spread out wider or closer together. Therefore, there are really many normal distributions, not just a single one. The normal distribution is a two-parameter family, where the two parameters are the mean and standard deviation. Here's a tool you can play with online that illustrates a normal distribution. In real life, it looks like a triangular-shaped pegboard into which balls are dropped. When there's an equal probability that the balls will drop either left or right, their final placement forms a normal distribution. However, when the probability of the balls dropping left or right is unequal, which is something you can experiment with using this tool, the distribution changes. The formulas for mean and standard deviation are very complex, but you will not have to compute them because the software will. With continuous variables, there is a continuum of possible values, such as all values between 0 and 100, or all values greater than 0. Instead of assigning probabilities to each individual value in the continuum, the total probability of 1 is spread over the continuum. Thus, the shaded area within the bell curve will always have an area of 1. The key to this spreading is called a density function, which acts like a histogram. The higher the value of the density function, the more likely this region of the continuum is. A density function, usually denoted by fx, specifies the probability distribution of a continuous random variable x. The higher fx is, the more likely x is. Probabilities are found from a density function as areas under the curve. So, for example, the shaded portion under this bell curve represents the probability of x being between 65 and 75. The cumulative distribution function, or CDF, is the probability that the variable takes a value less than or equal to x. It's the total area under the normal curve up to x. The beauty of the normal curve is that no matter what its mean and standard deviation are, the area between the mean minus 1 standard deviation and the mean plus 1 standard deviation is always about 68%. The area between the mean minus 2 standard deviations and the mean plus 2 standard deviations is always about 95%. The area between the mean minus 3 standard deviations and the mean plus 3 standard deviations is always about 99.7%. That means almost all values fall within 3 standard deviations on either side of the mean. That's true for all normal curves, no matter their shape. But how good is this rule for real data? Let's go ahead and check out an example. Here's our data. The mean of the weight of 120 women runners in a sample is 127.8 pounds. The standard deviation is 15.5. Here's what our distribution would look like. Let's look a little more closely at that distribution. 68% of our 120 runners is about 83 runners. According to the 6895-99.7 rule, those runners should fall fall within one standard deviation of the mean weight of 127.8 pounds. That is, 83 of our runners should fall between 112.3 and 143.3 pounds. When we check our data, we see that 79 runners fall within one standard deviation of the mean. Furthermore, 95% of our group, or about 114 runners, should fall within two standard deviations of the mean, or between 96.8 and 158.8 pounds. The data shows that 115 runners fall within two standard deviations of the mean. Finally, according to the rule, 99.7% of our runners, or 119.6 runners, should fall within three standard deviations of the mean, or within a range of 81.3 pounds to 174.3 pounds. According to our data, all 120 runners fall within this range. So it seems as if the rule is pretty accurate in this case. This concludes our video on normal distributions, continuous distributions, density functions, and cumulative distribution functions, as well as the 6895-99.7 rule. This video will cover kurtosis, including a definition as well as positive and negative kurtosis, and asymmetrical distributions, including those of positive and negative skew. Kurtosis is the measure that describes the size of the tails in a distribution. A distribution with positive kurtosis contains fewer values in the tails than a normal distribution. A distribution with more values in the tails has negative kurtosis. The normal distribution is a type of symmetrical distribution, in which the mean is equal to the median, and there is an equal probability of a value falling on either side of the mean. Although normal distributions and other types of symmetrical distributions are very common, there are often distributions that are asymmetrical. We call these types of distributions skewed. For a negatively skewed distribution, the left tail is longer and the mean is less than the median. This is due to the occurrence of outliers at the lower end of the distribution, away from the most frequently occurring values. A good example of this is the height of NBA players. Since taller basketball players have an advantage, the majority of NBA players are very tall. For a positively skewed distribution, the right tail is longer, as there are outliers with larger numbers. These outliers cause the mean to be greater than the median. An example of a positively skewed distribution is in the salary of baseball players. There are a few star players who make much more than the majority of players. These high salaries are outliers that make the mean of the distribution increase. For distributions that are skewed both positively and negatively, the median is a better representation of central tendency than the mean, as the outliers will impact the calculation of the mean. The median is not affected by outliers and typically is a much better approximation for the middle of a distribution. This concludes our video on kurtosis. Today we defined kurtosis and discussed positive and negative kurtosis. We also covered asymmetrical distributions, including those of positive and negative skew. This video will cover sampling basics, including populations and inferences, selecting a sample, random sampling, stratified sampling, and cluster sampling. A population is a set of all members about which a study intends to make inferences. Here's a population of people. We'd like to study their television watching behavior and infer how many watch a particular show so we can decide whether to purchase advertising spots during this period of time. But our population is much too large to study effectively. To study their television viewing habits, we'll have to survey them, and it's not feasible to survey every single individual. So we'll take a sample of the population to study. The sample will represent the population as a whole, and so will its survey results if we choose our sample correctly. We will focus on selecting a sample using probability. A probability sample is chosen from a population using a random mechanism. There are two types of probability sampling, stratified and cluster. A random sample is only random if each individual has the same chance of being chosen from the population. So back to our television viewing research. We want to figure out how many television viewers there might be during a particular show. Let's say there are 30,000 viewers in our population. Each viewer is known as a unit. In order to select a sample N of viewers from this population of 30,000, we could choose to use a simple random sample. This means that there is an equal probability that each viewer could be selected for inclusion in the sample. If our desired sample size was 200 viewers, we would find a way to select 200 viewers randomly, and then we could send each of those viewers a questionnaire in the mail about their viewing habits. But suppose various subpopulations within the total population can be identified. These populations are called strata. Instead of taking a random sample from the entire population, we might get better information by selecting a simple random sample from each stratum separately. This is called stratified sampling. Examples of subpopulations in television viewers might include age or gender. There are several advantages to stratified sampling. One obvious advantage is that separate estimates can be obtained within each stratum, which would not be obtained with a single random sample from the entire population. For example, let's say we're looking at our television viewers by age group. We have three strata, 18 to 24, 25 to 39, and 40 plus. We find that their peak viewing time is very based on age, thus we can make better decisions about which product to advertise during which time period. A more important advantage of stratified sampling is that the accuracy of the resulting population estimates can be increased by using appropriately defined strata. In cluster sampling, the population is separated into clusters, such as regions of the country, and then a random sample of the clusters is selected. The primary advantage of cluster sampling is sampling convenience and possibly lower cost. Selecting a cluster sample is straightforward. The key is to define the sampling units as the clusters, such as the regions of the continental US shown here. This concludes our video on sampling basics. Today we covered populations and inferences, selecting a sample, random sampling, stratified sampling, and cluster sampling. This video will cover bivariate data, scatter plots, and null values. Measures of central tendency, variability, and spread summarize a single variable by providing important information about its distribution, often more than one variable is collected on each individual. For example, in large health studies of populations, it's common to obtain variables such as age, sex, height, weight, blood pressure, and total cholesterol on each individual. Economic studies may be interested in, among other things, personal income and years of education. As a third example, most university admissions committees ask for an applicant's high school grade point average and standardized admission test scores like the SAT. Bivariate data consists of two quantitative variables for each individual. In contrast with univariate or single variable data, our first interest is in summarizing such data in a way that's analogous to summarizing univariate data. By way of illustration, let's consider something with which we're all familiar, age. Let's begin by asking if people tend to marry other people of about the same age. Our experience tells us yes, but how good is the correspondence? One way to address the question is to look at pairs of ages for a sample of married couples. Table one shows the ages of 10 married couples. Going across the columns, we see that yes, husbands and wives tend to be of about the same age, with men having a tendency to be slightly older than their wives. This is no big surprise, but at least the data bear out our experiences, which is not always the case. The pairs of ages in table one are from a data set consisting of 282 pairs of spousal ages, too many to make sense of from a table. What we need is a way to summarize the 282 pairs of ages. We know that each variable can be summarized by a histogram, which is a graphical representation of a distribution. A histogram partitions the variable on the x-axis into various contiguous class intervals of usually equal widths. The heights of the bars represent the class frequencies. Here we can see that each distribution is fairly skewed with a long right tail. We can also summarize the variables with a mean and standard deviation. From table one, we can see that not all husbands are older than their wives, and it's important to see that this fact is lost when we separate the variables. That is, even though we provide summary statistics on each variable, the pairing within the couple is lost by separating the variables. We cannot say, for example, based on means alone, what percentage of couples has younger husbands than wives. We have to count across the pairs to find this out. Only by maintaining the pairing can meaningful answers be found about the couples. Another example of information not available from the separate descriptions of husbands and wives ages is the mean age of husbands with wives of a certain age. For instance, what is the average age of husbands with 45-year-old wives? Finally, we don't know the relationship between the husband's age and the wife's age. We can learn much more by displaying the bivariate data in a graphical form that maintains the pairing. Figure two shows a scatter plot of the paired ages. The x-axis represents the age of the husband and the y-axis the age of the wife. There are two important characteristics of the data revealed by Figure two. First, it's clear that there's a strong relationship between the husband's age and the wife's age. The older the husband, the older the wife. When one variable y increases with the second variable x, we say that x and y have a positive association. Conversely, when y decreases as x increases, we say that they have a negative association. Second, the points cluster along a straight line. When this occurs, the relationship is called a linear relationship. Figure three shows a scatter plot of arm strength and grip strength from 149 individuals working in physically demanding jobs, including electricians, construction, maintenance workers, and auto mechanics. Not surprisingly, the stronger someone's grip, the stronger their arm tends to be. There is therefore a positive association between these variables. Although the points cluster along a line, they're not clustered quite as closely as they are for the scatter plot of spousal age. A common problem when working with real-world data is the presence of missing or null values within a data set. There are three strategies to deal with the issue. The first one is to omit the rows. If the variable is very important to the analysis and there are not many observations with missing values, it can be acceptable to filter or delete those rows. The second is to treat missing as a separate category. If the variable is categorical, this is easy. If the variable is numeric, then the variable will need to be binned and a category created for the missing rows. The third is to impute a value using distribution measures, such as the mean or the median or other variables. If values of other fields have differing distributions for the variable with missing values, we can calculate separate distribution measures using these categories. This concludes our video on bivariate data, scatter plots, and null values. This video will cover uncertainty, entropy, and analyzing data. The result of data analysis is information. Information resolves uncertainty. The uncertainty of an event is measured by its probability of occurrence. The more uncertain an event, the more information is required to resolve the uncertainty of that event. Entropy refers to the fact that you cannot stir things apart. It's a measure of information content and unpredictability. Here's a concrete example of entropy. If you have cold water and hot water and mix them together, you'll have warm water. You can't separate the cold and the hot after they're mixed. This is what is meant by you cannot stir things apart. To get an informal, intuitive understanding of the connection between these terms, consider the example of a poll on some political issue. The outcome of the poll is relatively unpredictable, and actually performing the poll and learning the results gives some new information. These are just different ways of saying that the entropy of the poll's results is large. Now let's say a second poll is performed shortly after the first poll. Since the result of the first poll is already known, the outcome of the second poll can be predicted well, and the results should not contain much new information. In this case, the entropy of the second poll result is small relative to the first. Now consider the example of a coin toss. When the coin is fair, that is, when the probability of heads is the same as the probability of tails, then the entropy of the coin toss is as high as it can be. That's because there's no way to predict the outcome of the coin toss ahead of time. Such a coin toss has one bit of entropy, since there are two possible outcomes that occur with equal probability, and learning the actual outcome contains one bit of information. On the contrary, a coin toss with a coin that has two heads and no tails has zero entropy, since the coin will always come up heads, and the outcome can be predicted perfectly. In this graph, entropy is maximized when the probability is 50%. When the probability is zero or 100%, there is zero entropy. By adding information, we can reduce entropy and gain certainty. How does entropy apply to analyzing data? We can use the principles of entropy to decide what results are important and should be included in a report, and what results are trivial and should not be included. If there is no uncertainty for a variable, then there's no information. If we obtained a new observation, we would already know the value of that variable. Typically, variables with only one value are excluded from an analysis at the very beginning. On the other hand, if a variable has maximum uncertainty because it contains values that are equally likely, it will be difficult to guess the value for a new observation. These variables are not excluded from analysis. Applying this concept to data analysis, our goal is to reduce entropy. By explaining outcomes using other variables, we are reducing uncertainty, and these results should be the focus of a report. This concludes our video. Today we covered uncertainty, entropy, and analyzing data. This video will cover the parts of an analytical report, including the introduction, data, analysis, and results and conclusion. The introduction should provide a concise summary of the project, including the problem faced, the type of data gathered, and the highlights of the solution. The data section should go into detail about the data used to complete the project. Variables and other technical terms should be defined well. An example value should be listed and interpreted. Also, this section should mention any abnormalities in the data, such as missing values, and discuss the steps that were taken to clean and prepare the data for analysis. In the analysis section, the report should cover the thought process behind the analysis, including any output and data visualizations that are pertinent. Methods that are used should be introduced, along with a brief description for the reasons they were used, and possibly including references to external sources for further study. Care should be taken to not include every possible analysis, as this can provide information overload to the audience. Insignificant or less significant findings can be briefly summarized, leaving the majority of content in this section to focus on the most important findings. The results and conclusions section should summarize the results of the analysis, and if applicable, provide specific recommendations on a course of action. Because the analysis section covered most of the information gleaned from the data, the results and conclusions should mostly just apply that information towards a goal or a further course of study. Combined, the introduction, data analysis, and results and conclusions sections of an analytical report delivered succinctly and clearly by an effective business analyst can provide useful and pertinent information to drive business improvement. This concludes our video on writing analytical reports. Today we covered the parts of a report, including the introduction, data, analysis, and results and conclusion. This video will cover automation, including a brief introduction to automation, macros, and stored procedures. A report is not the only result of a business analytics project. Typically, automation will often be a separate goal, especially when a data source is often updated with new records. The main benefit of automation is that it frees up the analyst's time to work on different problems. This is very valuable from a business perspective as analysts that are skilled in automating tasks can save the company money in terms of labor expenses, and over time can uncover more and more patterns in the data leading to greater profitability. There are two main types of automation used in business analytics, macros, and stored procedures. Macros are also referred to as functions. The purpose of macros are to easily replicate certain steps without having to write out those steps individually. Macros can make an analysis quicker and more concise. For instance, with a certain dataset, you may want to filter, sort, and then take the average of a field. If this set of steps will be used multiple times or for different datasets, you may want to transform these steps into a macro to save time. The parameters of a macro are the input. This can be datasets, fields, or values. The parameters are a component that will change with each call to a macro. After the steps of the macro are complete, the output will be returned, and like the parameters can also be of different types. Macros are the foundation of many analytics and software packages, as many of the complex algorithms used in data science are mostly just the building of certain simpler functions. By utilizing macros that have already been built, an analyst can become more efficient. By building customized macros, an analyst can easily pass on their efforts to other analysts. Most companies with a data science team will build out a repository of customized macros called a code base. It makes it easier to manage certain tasks, such as committing changes and version control. These concepts are taken from computer science best practices of developing software. Another concept that comes from computer science is the use of object-oriented programming. This principle guides the development of macros and is based on the idea that the components of code should be compartmentalized. The main benefit of this technique is that it makes large projects easier to develop and maintain. Instead of an analyst having to replicate an analysis manually every time new data is captured, a stored procedure will execute. The stored procedure is a type of algorithm that runs according to a schedule and executes a series of steps that the analyst sets up in advance. Usually these steps will be in the form of code, but new software programs allow analysts to build stored procedures without having to write code. Once code has been productionalized, only monitoring the execution of the stored procedure is necessary. When the procedure completes with an error, the analyst will need to troubleshoot the code. The most common reason for errors to arise are due to unforeseen data values. That is, variables will contain value types that were not present in the original dataset. For example, a numeric field will contain an alphabetic character. The best practice is to test for these values within the stored procedure and transform them or remove records altogether and add a warning message. This will prevent the stored procedure from failing altogether. Logs are output from stored procedures to describe the processes that ran. Information that logs produce can vary, but usually include the amount of time each process took, the number of records that were input, and any error or warning messages that were triggered. Analysts should add code to stored procedures to make logs more descriptive and thus easier to troubleshoot when problems arise. Depending on the tools being used and the amount of data being processed, productionalizing an algorithm can be either a simple task for one analyst or a multi-year project involving many analysts, project managers, database administrators, documentation writers, and quality assurance specialists. In these cases, management methodologies such as agile software development are used to coordinate team members and progress through the project lifecycle. This concludes our video on automation. Today we covered a brief introduction to automation, macros, and stored procedures. This video on regression will cover simple linear regression analysis, regression line fitting, observed and predicted values, the least squares coefficient estimation, goodness of fit, explained and unexplained variation, root mean square error, and the coefficient of determination, significance testing, and regression assumptions. Regression analysis is used to predict the value of one variable, the dependent variable Y, on the basis of other variables, the independent variable X. In other words, if you know something about X, you can use it to predict something about Y. We provide the independent variable X and we observe the dependent variable Y. The linear regression equation is shown here. The variable X is considered the independent or predictor variable. The variable Y is the dependent or outcome variable. We have data on both X and Y. We use this information to estimate the value of the intercept, beta zero, and the slope, beta one, that relate to X and Y. Since the linear relationship is not exact, we include an error term in the model, epsilon. Why is this useful? Because once we have estimates of beta zero and beta one from our regression, we can use this for any value of X to predict what the value of Y would be. So if we have data on NBA players weight and height, we can estimate how much a typical NBA player weighs based on his height. Then if someone wants to join the NBA and we know what his height is, we can estimate what weight he should be to make it to the NBA. In linear regression analysis, our goal is to estimate a pattern, in this case a line, that best fits the data. The best fit for our data will go through the core of our data and minimize error. The linear relationship in algebra is when a line is represented by its slope and intercept. But instead of an exact relationship, regression analysis estimates the line from data. Since the line does not fit data points precisely, there is an error term, EI. Measuring the deviation of actual Y from estimated Y. Using the least squares coefficient estimation, we can obtain the line that best fits the data by minimizing the sum of squared errors. The total variance in Y is divided into two parts, that which can be explained by X using regression and that which cannot. No line is perfect. There's always some error in the estimation. Unless there's a comprehensive dependency between the predictor and response, there's always some part of the response Y that can't be explained by the predictor, X. Using the mean value of Y as our reference point, we can decompose the total error and measurement between the part that is explained by the regression line and the part that remains unexplained. Here's a look at the decomposition. The sum of squares regression, or SSR, is the explained variation attributable to the linear relationship between X and Y. The sum of squares error, or SSE, measures the variation attributable to factors other than the linear relationship between X and Y. The SST is the total sum of squares. The SSR and SSE together make up the SST. So given that total variation SST is the sum of explained and unexplained variation, we can divide through by the total sum of squares SST to get the ratios equal to one. The ratio of error sum of squares over total sum of squares plus the ratio of regression sum of squares over total sum of squares equals one. Either of these two ratios can be used to measure our model fit. Error sum of square ratio is called the mean square error. We want to choose models with the lowest mean square error. Regression sum of square ratio is called R squared or the coefficient of determination. We choose a model with the highest R squared. Because R squared plus mean square error equals one, it has to be true that R squared, R coefficient of determination, has to lie between zero and one. Likewise, we can look at the coefficients of the intercept and slope to see if they're significantly different from zero. This is done by examining the T statistic and the corresponding P values. P values less than 0.05 imply that the coefficient is significantly different from zero. Some key assumptions before you apply regression techniques. First, the variables have to have a linear relationship. In this example, the relationship between X and Y is not linear, so we cannot fit a linear regression line. The variables also have to be approximately normally distributed. In this example, Y is not normally distributed at each value of X, so it doesn't make sense to fit a linear regression line. Additionally, the variance of Y at each value of X should be the same, or in other words, we should have a homogeneity of variances. The fourth and final assumption is that the observations are independent. Here, one trend line is not sufficient. In other words, if sales today depend on sales yesterday, then linear regression models will not work. This concludes our video on regression. Today we covered simple linear regression analysis, regression line fitting, observed and predicted values, the least squares coefficient estimation, goodness of fit, explained and unexplained variation, the root mean square error and coefficient of determination, significance testing and regression assumptions. This lesson covers the t-distribution and how it compares to the normal distribution, as well as a brief look at the student's t-distribution. For large samples, the normal distribution applies. For small samples, the standard deviation is measured in precisely and the data followed the t-distribution. A t-distribution will approach a normal distribution for a larger n greater than or equal to 100, but it has fatter tails for a smaller n, less than 100. The t-distribution is very similar to the standard normal distribution. It also has a bell curve, but the standard deviations are computed from the sample data instead of the population. Suppose a simple random sample of size n is drawn from a population whose distribution can be approximated by a normal mu sigma model. When the standard deviation is known, then the sampling model for the mean x is distributed as a normal distribution with mean x bar and standard deviation sigma divided by the square root of n. When the standard deviation is estimated from the sample standard deviation s, the sampling model follows a t-distribution with degrees of freedom n minus one. This is the one sample t statistic. In this figure, both distributions have zero means, but the variances are a bit different. The t-distribution has a lower peak and fatter tails. This concludes our video on t-distributions. Today, we covered the t-distribution and showed how it is very similar to the normal distribution. We also briefly showed the student's t-distribution. In this video, we will cover logistic regression, including the need for logistic regression, the logistic regression model, and odds ratios and prediction. In many instances, when you're testing hypotheses and making predictions, you will have dichotomous outcomes. For example, in a game, you can either win or lose. On a website, a user either clicks or does not click. In an election, a person votes for a candidate or does not. When the outcome variable is categorical, such as our game example, it does not follow a normal distribution. The outcome variable is a probability, measured between zero and one. The estimates you make should be numbers in that range. A linear model cannot be applied. We need a nonlinear function. There are many nonlinear models we can choose from to fit our data. Some nonlinear functions are shown here. The last one shown is the logistic function. That will fit the data the best because it has an upper and lower bound. Here's a logistic regression model in orange versus a linear regression model in blue. Isn't it a better fit for the data? To best suit our data, we want a model that predicts probabilities between zero and one, so it will be S-shaped. There are lots of S-shaped curves, but the logistic regression model is what we'll use in this instance. The logistic function is a nonlinear function of independent variables. However, we can convert this nonlinear function into a linear relationship using the log of the odds ratio. Note that instead of modeling just zeros and ones, we're modeling the probability of an event occurring. With the logistic regression model, instead of winning or losing, we build a model for log odds of winning or losing. It's a natural logarithm of the odds of the outcome. P stands for the probability of the outcome, while one minus P stands for the probability of not getting an outcome. However, having log of P over one minus P on the y-axis is not very helpful. We have to compute the actual odds. To do that, we have to use the exponential functions. Let's look at a very simple example of a logit function. Does alcohol drinking predict political party? Political party is the outcome variable, and it is binary. Therefore, we need a logistic regression. A typical logit equation contains the log of the odds ratio as the outcome, which is a linear function of the predictor's x. The logit model to measure the impact of drinking on voter choice is going to be set up as follows. The log of the odds ratio will be measured from data on x, where x here is the number of drinks per week. It's really important to understand that negative 1.4 is measuring the log of the odds ratio. In other words, it is the log of the probability of being Republican, divided by the probability of not being a Republican. To get the actual odds ratio, you have to compute the exponent, which is equal to 0.25. Since the odds are less than one, it tells us that the more you drink, the lower your odds of being Republican. All of these calculations can be done automatically in SAS, but it's important to understand the math behind what SAS is doing. The same model can be extended to more than one variable. We just add more predictors to the equation. The coefficient beta measures the impact of x on the log of the odds ratio. For example, in linear regression, if y equals two plus three x, a one unit increase in x will increase y by three units. In a logistic regression, log p over one minus p minus two plus three x shows us that if x increases by one unit, then the log odds of p, y equals one, increases by three units. The impact on the odds ratio is represented by e exponent beta. We can also compute probabilities directly. To compute odds, we have to use the exponent. If we don't want to look at odds, but the actual probabilities, we apply the entire logistic function formula as shown in this probability function. These are the odds ratios and log of odds ratios for various probabilities. An important point to understand is how the odds ratios are tied to the probabilities. Note the mathematical equivalencies. A 50% probability or probability of 0.5 is the same as one to one odds. The log of the odds ratio at that point is equal to zero. As probability increases, odds ratio increases from zero to infinity while the log of the odds ratio can become any value. This concludes our video on logistic regression. Today we covered the need for logistic regression, logistic regression model, and odds ratios in prediction. This video will cover two types of statistical error, type one or alpha, and type two or beta. All statistics derived from samples are subject to error. A type one error rejects the null hypothesis when it is actually true. A type two error accepts the null hypothesis when it is not true. Remember that a different sample can give a completely different result. A sample mean is likely to fall in the confidence interval only 95% of the time, so the inferences drawn from the sample may be wrong. Let's talk in a little more detail about the type one error. The type one error occurs when a researcher thinks he or she has found a significant result, but really that result is due to chance. It's similar to a false positive on a drug test. The type one error or the mistake of rejecting the true null hypothesis will happen with a frequency of alpha. Thus if alpha, our critical value is 0.05, then a type one error will occur 5% of the time. On the other hand, a type two error occurs when results seem insignificant, but in fact there was something significant going on. Type two errors are like a false negative on a drug test. They occur when the alternative hypothesis is true, but there's not enough evidence in the sample to reject the null hypothesis. This type of error is traditionally considered less important than a type one error, but it can lead to serious consequences in real situations. The power of a test is one minus the probability of a type two error. It is the probability of rejecting the null hypothesis when the alternative hypothesis is true. In these competing sampling distributions, alpha is set to 0.05. The bottom curve assumes HA is true. The top curve assumes that the null hypothesis H0 is true. Its right tail shows that we will reject H0 when a sample mean exceeds 189.6. The probability of getting a value greater than 189.6 on the bottom curve is 0.5160, corresponding to the power of the test. Here's a table that summarizes the types of errors. Here's an example using a fire alarm. If a fire alarm is silent and there's no fire, our null hypothesis that it is working is correct. But what if the assumption is wrong? Then we've accepted the null hypothesis, but we actually have a fire. That's our type one error. The opposite case may also happen. If the alarm goes off and there's actually a fire, there's no error. But if there's no fire and the alarm goes off, it's a false alarm. That's the type two error. Here, it is a less serious problem. This concludes our video on statistical error. Today we discuss type one or alpha error and type two or beta error. This video will cover hypothesis testing, which is also called significance testing, and occurs when we test a claim about a population parameter using sample evidence that confirms or rejects the claim. There are four steps in the hypothesis testing process, all of which will be covered in this video. Here's a summary of the four steps in hypothesis testing. After this, we'll discuss each step in detail. The first step is stating the null and alternative hypotheses. We have to establish what we are testing to be true. Once we do that, we have to decide how close to true our sample statistic has to be for us to accept the truth. For example, we might want our estimate to be accurate with a 5% margin of error. This is called locating the critical region. Once we know that, we have to compute the test statistic, the Z value or the T value. Finally, based on our results, we draw conclusions from the study. The first step in the procedure is to convert the research question into a statement of the hypothesis null and alternative forms. Our study will be to collect and seek evidence against the null hypothesis as a way of deductively bolstering the alternative hypothesis. The null hypothesis, abbreviated H naught, is a statement of no difference. In other words, the null hypothesis argues that there is no significant difference between our specified populations and that any observed difference is due to sampling or experimental error. The alternative hypothesis, or HA, is the opposite of the null hypothesis. It provides a statement of difference. In our study, we will seek evidence against the claim of H naught as a way of proving HA. Here's an example of setting up the null and alternative hypotheses. In the late 1970s, the weight of US men between 20 and 29 years of age had a log normal distribution with a mean of 170 pounds and a standard deviation of 40 pounds. To illustrate the hypothesis testing procedure, we ask if body weight in this group has changed since 1970. This is called our research question and it can be answered in one of two ways. Under the null hypothesis, there is no difference in the mean body weight between then and now, in which case mu would still equal 170 pounds. Under the alternative hypothesis, we assert that the mean weight has changed. Mu is not equal to 170 pounds. This is called a two-sided test, the most common form of hypothesis testing. We can also do a one-sided test in which we ask if weight has increased over time, so the alternative hypothesis would be mu is greater than 170 pounds. In step two, we will locate the critical region. Once we've established the research question, we have to define the level of accuracy with which we want to measure our test statistic. Any estimate from a sample will not be exactly the same as the population parameter, so we have to decide what we think is likely versus unlikely. This is called locating the critical region. The critical region consists of outcomes that are very unlikely to occur if the null hypothesis is true. Or in other words, the sample means that are almost impossible to obtain. When we're estimating population parameters using a sample, we have to determine the cutoff values. These cutoff values are called alpha. If we decide that we want to measure the mean with a 90% precision level, then the shaded area on the left and right will be larger. If we want to measure with a 1% precision, then the area will be smaller and the range will be larger. These are the locations of the critical region boundaries for three different levels of significance. Alpha equals 0.05, alpha equals 0.01, and alpha equals 0.001. Note that boundaries get wider as the critical value falls. In most cases, researchers choose an alpha of 0.05 or 0.01. Our rejection region should have a probability of alpha if the null hypothesis is true, but some bigger probability if the alternative hypothesis is true. So if the mean lies inside the cutoff value for alpha, then the null hypothesis is true. Otherwise, we fail to accept the null hypothesis. The result is significant beyond the alpha level. For example, if alpha is 0.05, our result is significant if it's less than 0.05. Once we decide whether we want to measure accuracy at the 10%, 5%, or 1% level, we can compute the test statistic. Here we will use the Z score, which is a ratio comparing the obtained difference between the sample mean and the hypothesized population mean. This is an example of a one sample test of a mean when the standard deviation sigma is known. In our male weights example, we're going to use the Z statistic because we know the population mean and the population standard deviation. To compute the Z statistic, we simply insert values derived from our sample into the formula. If in one sample we found that the sample mean was 173, then the Z statistic would be 0.60. Think of this value on the x-axis under a standard normal curve. Let's say we found the sample mean to be 185. Putting these values into the Z-stat formula, we find the Z-stat is 3.0. This is much higher at the tail end of the x-axis on a normal distribution. The final step is drawing conclusions. Once we've computed the Z value of our test statistic, we have to look at the corresponding probability values to find out if it's reasonably close to the population mean. A large value shows that the obtained mean difference is large and in the critical region. The difference is significant, which means we have to reject the null hypothesis that the weights have not changed over time. If the mean difference is relatively small, then the test statistic will have a low value. In this case, we conclude that the evidence from the sample is not sufficient and the decision is to fail to reject the null hypothesis. The P value is the area under the normal curve in the tails beyond the Z-stat. It answers the question, what is the probability of the observed test statistic or one more extreme when H naught is true? To convert Z statistics to P value, we will use software. In one sample with the sample mean of 173, the Z statistic was 0.60. If we had this sample, we would fail to reject the null hypothesis that the mean weights have increased over time. Likewise, if we computed the P values for Z equals 3.0, we would get 0.001, which means we have to reject the null hypothesis that the mean weight has remained the same over time. Note that when we're looking at weight change instead of weight increase, all we have to do is multiply the one-sided P value by two to do a two-tailed test. Since we will be using P values in all our subsequent analysis, it's worth emphasizing what that means. P values ask the question, what is the probability of the observed test statistic when H naught is true? Remember, the smaller the P value, the more likely that your null hypothesis is not true. This graphic depicts the significance of P values at less than 1%, between 1 and 5%, between 5 and 10%, and greater than 10%. These are common significance levels. 5% is the most common cutoff. However, note that it's unwise to draw firm borders for significance. As an example, a P value of 0.27 would not be significant against H naught. A P value of 0.01, on the other hand, would be highly significant against H naught. This concludes our video on hypothesis testing, also called significance testing, which occurs when we test a claim about a population parameter using evidence that confirms or rejects that claim. Today we covered the four steps in hypothesis testing, state the null and alternative hypotheses, locate the critical region, compute the test statistic, and draw conclusions. This presentation will cover correlation, including a definition of correlation, a discussion of the need for correlation, details on computing correlation, including variance, covariance, and the correlation coefficient, strength of association, linear and curve linear relationships, properties of correlation, r squared, the coefficient of determination, and a discussion of correlation versus causation. Correlation is one of the most common and useful statistics. It's a measure of association, a single number that describes the degree of relationship between two variables. We can examine correlations between two variables heuristically by looking at a scatter chart. In this chart, our observations are very tightly centered around the line. In this case, we would say that the relationship between X and Y is more correlated. We call this a strong correlation. By contrast, if the observations are scattered further out, we might say the relationship between X and Y is less correlated or that there is a weak correlation. Here are some examples of questions that ask about correlation. Is there any association between hours of study and grades? Is there any association between the number of churches in the city and the murder rate? When the weather gets hot, what happens to sweater sales? What is the strength of association between them? What about the sale of ice cream versus temperature? What is the strength of association between them? Furthermore, how do we quantify the association? While we can guess the relationship, there's a better way to do this using statistical measures. The measure we use is the Pearson correlation coefficient. To compute correlation, we'll need information on standard deviation and covariance. We know that the variance is the dispersion within a variable X or Y or the squared average deviation from the mean as shown here. The covariance is the dispersion of X multiplied by the dispersion in Y. It is calculated as the average of the product of deviations in individual means. Using the information on variance and covariance, we can compute the correlation coefficient as the covariance of X and Y divided by the standard deviation of X multiplied by the standard deviation in Y. This measure of correlation ranges from negative one to positive one. A higher number is a stronger correlation and a lower number is a weaker correlation. Correlation coefficient R measures the strength of linear association. It measures the extent to which two variables are proportional to each other. It's unit free. So for example, a measure of correlation between player height measured in inches and player weight measured in pounds will be meaningful even if they're measured in different units. Here are some examples. No linear association, negligible negative association, weak positive association, moderate negative association, very strong positive association, very strong negative association. In these scatter plots, what is happening to Y as X is increasing? An important point to remember is that correlation is a measure of linear association. If the relationship is curvilinear using the correlation measure is not appropriate. If X changes and Y stays the same, then the correlation is zero. Since the correlation measure is a measure of linear association, we cannot use correlations on categorical data. It's related to sample size and it's also very sensitive to outliers. The correlation measure R measures the strength of linear association. Squaring the correlation coefficient gives S R squared, which is the coefficient of determination. It is the proportion of common variation in two variables. This measures the strength or the magnitude of the relationship. While we cannot use percentage to interpret R, we can do so for R squared. For example, if R squared equals 67%, then we can say that 67% of variation in X is related to variation in Y. Correlation does not imply causation. It's easy to see that in this chart, the internet explorer market share correlates with the murder rate in the US, but that doesn't mean that one caused the other. Causal relationships are determined based on facts and business models. We cannot determine causality from data. Correlation is a mathematical formula. You will get a number no matter what data you feed. First, you need to establish a logical relation and then find the correlation. Variables may be correlated if they have a causal relationship. For example, water causes plants to grow. Correlation can also occur when one variable is both the cause and the effect. For example, coffee consumption can cause nervousness, but it's possible that nervous people also drink more coffee. Correlation can also be high because both variables move together due to a missing third variable. For example, this comparison of deaths due to drowning and soft drink consumption during summer. Both variables are related to heat and humidity, a third variable not shown here. Omitting such variables can be dangerous. Here's a look at some additional measures of correlation using scatter charts. This concludes our video on correlation. Today we discuss the definition of correlation, the need for correlation, details on computing correlation, including variance, covariance, and the correlation coefficient, strength of association, linear and curvilinear relationships, properties of correlation, R squared, the coefficient of determination, and correlation versus causation. This video will cover binomial distributions, which are a type of discrete distribution. We will first compare discrete and continuous distributions of a single random variable, and then we'll look at the binomial distribution specifically. There are two types of random variables, discrete and continuous. A discrete random variable has only a finite number of possible values, whereas a continuous random variable has a continuum of possible values. Usually a discrete distribution results from a count, whereas a continuous distribution results from a measurement. The distinction between counts and measurements is not always clear cut. A probability distribution is simply a mapping of all distinct events for a variable and their probability of occurrence, such as the distribution of a coin flip experiment. The form of the distribution depends on whether the variables are discrete or continuous. Here are some examples of discrete variables, outcomes of dice rolls, whether a customer likes or dislikes a product, or the number of hits on a website. Some examples of continuous variables include the weekly change in the Dow Jones industrial average, daily temperature, or the time between machine failures. To specify the probability distribution of event X, we need to specify all of its possible values and their probabilities. We assume that there are K possible values and write out our list of possible values like this. A typical value is denoted like this. And the probability of a typical value is denoted like this. Next, we will discuss distributions of both discrete and continuous variables. For each type of variable, distributions can be characterized by three measures, mean, variance, and standard deviation. These are formulas for working with discrete distributions. While we won't be computing these measures by hand, you do need to be aware of the formulas. The mean, also called the expected value, is calculated with this formula. The mean is a weighted sum of all possible values, weighted by their probabilities. Mean is denoted by the Greek letter mu. The variance is a weighted sum of the squared deviations of the possible values from the mean, where the weights are, again, the probabilities. The standard deviation is simply the square root of the variance. Standard deviation is denoted by the Greek letter sigma. These are the formulas for working with continuous distributions. A probability distribution visually summarizes the probabilities associated with all possible events for a variable. We will focus on three probability distributions that are commonly used in explaining real-world events. Binomial and exponential distributions are used with discrete data, while normal distributions are used with continuous data. A binomial distribution is a discrete distribution that represents the number of successes in N independent trials, each of which has the probability of success P. Each trial has a binary outcome. For example, a coin toss yields either heads or tails. The probability of either observation heads or tails is the same each time we toss the coin. These outcomes are generally called success and failure. The probability of success is P and the probability of failure is one minus P. The distribution maps the outcome of all the trials. Each trial has to be independent and the probability of success has to be the same for each trial. This is the probability mass function formula for a binomial distribution. If we toss a coin 100 times, what is the probability that we will get 40 heads? What is the probability of getting 90 heads? That probability can be computed by applying this formula. We have only two possible outcomes, one zero or success failure in N independent trials. This formula depicts the probability of exactly X successes. N is the number of trials. X is the number of successes out of N trials. P is the probability of success and one minus P is the probability of failure. All probability distributions are characterized by an expected value and variance. If we toss a coin 100 times, what would be the average number of heads we would get? What about the variance? These are computed using these formulas. You'll often see the assumptions of normal distribution being applied to discrete outcomes. This is because the binomial distribution approximates to a normal distribution for large samples. So for large enough samples, we can calculate probabilities using normal probability rules. This concludes our video on binomial distributions. Today, we covered discrete and continuous distributions of a single random variable and the binomial distribution. Dio will cover populations and inferences, sampling error and the central limit theorem. A population is the set of all members about which a study intends to make inferences. Here's a population of people. We'd like to study their television watching behavior to determine how many watch a particular show so we can decide whether to purchase advertising spots during this period of time. But our population is much too large for a feasible study. To study their television viewing habits, we will have to survey them and it's not feasible to survey every single individual. So instead, we'll take a sample of the population to study. Choosing a representative sample, we can make some inferences about the population behavior. But it's unlikely that one sample can provide accurate measures of behavior for the entire population. An estimate of the population parameter or the proportion watching a television show is likely to be different for different samples of the same size and is likely to be different from the population parameter. This is called sampling error. The sampling error is unknown, but we can estimate the extent of this error by applying the central limit theorem. The central limit theorem tells us that what we know about our sample can tell us about the larger population the sample came from. For any results that are generated from samples, we get a range of estimates of a population parameter which includes mean and standard deviation from each sample. In our example, it would be an estimate of the proportion watching a particular TV show. These estimates have their own distribution and the central limit theorem tells us that the distribution looks like a bell curve. The central limit theorem makes predicting outcomes a lot easier. If the sample size is large enough, then the sampling distribution of the mean is approximately normally distributed regardless of the distribution of the population. If all possible random samples, each of size n are taken from any population with the mean mu and the standard deviation sigma, the sampling distribution of the sample means or averages will have a mean, have a standard deviation and be approximately normally distributed regardless of the shape of the parent population. Remember that normality improves with a larger n and it all comes back to z. Note the symbols here. The mean of the sample means is noted as mu of x bar. The standard deviation of the sample means is written as sigma of x bar and is also called the standard error of the sample mean. That concludes our video. Today we covered populations and inferences, sampling error, and the central limit theorem. In this video, we will first define probability. Then we will cover the rule of complements, the addition rule, probabilistic independence, conditional probability, and the Bayes theorem. A probability is a number between zero and one that measures the likelihood that some event will occur for a random variable. An event with probability zero cannot occur, whereas an event with probability one is certain to occur. An event with probability greater than zero and less than one involves uncertainty. Here are some examples. The odds of winning a lottery, the likelihood of a particular candidate winning an election, or the chance of rolling a four on a fair die. In the case of the die, there are six sides, so the odds of rolling a four are one out of six. The complementary rule in probability is simply the probability of an event not occurring. If A is any event, the probability of A is P of A. The complement of A is the event that A does not occur. The probability of the complement of A is shown by this equation, one minus the probability of the event occurring. In our dice example, the probability of getting a four was one in six, so the probability of not getting a four is one minus one in six, which equals five in six. The addition rule of probability involves the probability that at least one of the events will occur. Events are exhaustive if they exhaust all possibilities. One of the events must occur. For example, when we roll a six-sided die, we will always end up with a number between one and six. We say that events are mutually exclusive if at most one of them can occur. For example, you can't roll a three and a six on one die at the same time. If you have two mutually exclusive events, like our three and six, then the probability of either one occurring is the sum of the two separate probabilities. If two events are independent or their outcomes aren't affected by each other, then the probability of both A and B occurring is simply the product of the two probabilities. In the case of our die, the probability of getting a six on the first roll and getting a three on the second roll is one in six times one in six, which equals a one in 36 chance. Sometimes the probability of one event will affect another. These are called dependent events, and their probabilities are called conditional. This is the formula for conditional probability. The conditional probability of A, conditional that B has already occurred, is equal to the joint probability of both of the events occurring together, divided by the probability of B occurring without regard to whether A has occurred or not. The Bayes theorem allows us to estimate posterior probabilities once we obtain new data. With it, we can measure the likelihood of event H occurring once we obtain particular pieces of evidence from data D. The parts of the theorem include the independent probability of H, or prior probability, the independent probability of D, the conditional probability of D given H, or likelihood, and conditional probability of H given D, or posterior probability. This concludes our video on basic probability. Today, we defined probability and covered the rule of complements, the addition rule, probabilistic independence, conditional probability, and the Bayes theorem. In this video, we will cover variable roles, including explanatory and outcome variables and variable classification, including qualitative variables, nominal, ordinal, and binary, and quantitative variables, discrete, continuous, interval, and ratio. Any analytics project first begins with a question. What is the problem we are trying to solve? To address that question, we need to collect data. The next step in the process is to understand the data collected, and only then can we move to further steps of data cleaning, data analysis, and solving the problem. A key step in understanding the information collected is to identify all the variables in the data set. We need to know what variable types we have in order to make them amenable to further analysis. Variables have two possible roles. The first is explanatory. Explanatory variables are also called features or independent variables. These are variables that are used as inputs to explain the variation in the outcome variable. The second role a variable can take on is outcome. An outcome variable is also known as a target or dependent variable. These are variables that measure the output or impact that's being studied. Most studies have many independent variables and one dependent variable. For example, a person's weight could be a function of age, gender, and calories consumed. Fuel efficiency is a function of features such as car size, weight, and number of cylinders. Restaurant ratings are a function of food quality, ambiance, and service. Variables can be qualitative or quantitative. Qualitative data can be nominal, ordinal, or binary. Quantitative data can be discrete or continuous with either an interval or ratio level of measurement. We'll start by discussing how to determine whether a variable is qualitative or quantitative. The best way to decide whether a variable is qualitative or quantitative is to use the subtraction test. If two experimental units, such as people, have different values for a particular measure, then you should subtract the two values and ask yourself about the meaning of the difference. For example, when hair color is coded as one equals blonde, two equals red, three equals brown, and four equals black, the difference between the variables has no meaning, so it fails the subtraction test, which means hair color is a categorical or qualitative variable. However, if the difference is meaningful, then it is a quantitative variable. For example, age and years. The differences between these numbers have a meaning, so the variable is quantitative. We will now discuss qualitative variables in detail. Categorical variables are those that have only a few possible values, thus assigning each value to a particular group or category. For example, oceans are a categorical variable. Nominal and ordinal variables are often called labels. A nominal variable has levels with arbitrary names, for example, car colors. Ordinal variables have a logical order, for example, exam grades. A dichotomous or binary variable is a categorical variable that has only two levels or categories, often the answer to a yes or no question. But a variable doesn't have to be a yes-no variable to be binary. It just has to have only two categories, such as gender. We will now discuss quantitative variables in detail. Quantitative variables are those for which the recorded numbers encode magnitude information based on a true quantitative scale. They can be discrete or continuous. A discrete variable has only whole number counts. A continuous variable can take on any value on the number scale. To determine whether a variable is discrete or continuous, use the midway test. If for every pair of values of a quantitative variable, the value midway between them is a meaningful value, then the variable is continuous. Otherwise, it's discrete. For example, age is continuous because the difference between ages 20 and 30 is meaningful. An example of a discrete variable is the number of children in a family. You can see here that 2.5 does not make sense. The interval level of measurement ranks data. It can be either discrete or continuous. With interval variables, precise differences between units of measure exist, but there's no meaningful zero. For example, take IQ scores. Make sense to talk about someone having an IQ 50 points higher than another person, but an IQ of zero has no meaning. Ratio variables are interval variables, but with the added condition that zero of the measurement indicates that there is none of that variable. True ratios exist when the same variable is measured on two different members of the population. For example, consider the weight of an individual. It makes sense to say that 150-pound adult weighs twice as much as a 75-pound child. However, it doesn't make sense to say that 70 degrees Fahrenheit is twice as hot as 35 degrees Fahrenheit, so temperature is not a ratio variable. This concludes our video on variables. In this video, we covered variable roles, including explanatory and outcome variables, and we also covered variable classification, including qualitative variables, nominal, ordinal, and binary, and quantitative variables, discrete, continuous, interval, and ratio. This video will cover basic information about coding, coding systems, and types of variables in coding, including binary, ordinal, nominal, and continuous. Coding is the process of translating the information gathered from questionnaires and other investigations into something that can be analyzed, usually using a computer program. Coding involves assigning a value to the information given in a questionnaire, and often that value is given a label. Coding can make the data more consistent. For example, if you ask the question what gender, you might end up with the answers, male, female, M, F, et cetera. Coding will avoid such inconsistencies. A common coding system for binary variables is the following, zero equals no and one equals yes, where the number is the value assigned, and the yes or no is the label of that value. Some like to use a system of ones and twos where one equals no and two equals yes. This brings out an important point in coding. When you assign a value to a piece of information, you must also make it clear what the value means. In the first example, one equals yes, but in the second example, one equals no. Either way is fine as long as it's clear how the data are coded. You can make it clear by creating a data dictionary as a separate file to accompany the dataset. A binary variable is any variable that is coded to have two levels, like this example. In SAS, data representing gender coded as MF would be converted into a binary variable. Here's an example. If we're asking about the number of years of education a person has, with a value of one for each year of education, that would mean anyone with more than 12 years of education has been to college, and anyone with less than 12 years of education has not been to college. We can recode into a binary yes, no variable by saying that if education is greater than 12, that implies that college equals one. Otherwise, college equals zero. This type of coding is useful in descriptive and predictive analytics. The coding process is similar with other categorical variables. For the variable education, we might code as follows. Zero equals did not graduate from high school, one equals high school graduate, two equals some college or post-high school education, and three equals college graduate. Note that for this ordinal categorical variable, we need to be consistent with the numbering because the value of the code assigned has significance. The higher the code, the more educated the respondent is. In SAS, we would convert years of education to education categories like this. Here's an example of what not to do. Zero equals some college or post-high school education, one equals high school graduate, two equals college graduate, and three equals did not graduate from high school. Can you tell what's wrong with this example? The data we're trying to code has an inherent order, but the coding in this example does not follow that order. Here's the correct way to do it. For nominal categorical variables, however, the order makes no difference. Here's an example. For the variable reside, one equals northeast, two equals south, three equals northwest, four equals Midwest, and five equals Southwest. It doesn't matter what order we use for these categories. Midwest can be coded as four, two, or five because there's not an ordered value associated with each response. Continuous variables are usually left in the same format as they are in the original data set. However, be careful about missing values and miscoded data. You may also need to code responses from fill in the blank and open-ended questions. With an open-ended question such as, why did you choose not to see a doctor about this illness, respondents will all answer differently. Also, you may give response choices for a particular question, but offer an other specify option as well, where respondents can write whatever response they choose. These types of open-ended questions can be a lot of work to analyze. One way to analyze the information is to group together responses with similar themes. For the question, why did you choose not to see a doctor about this illness, responses such as, didn't feel sick enough to see a doctor, symptoms stopped, and the illness didn't last very long could all be grouped together as, the illness was not severe. You will also need to code don't know responses. Typically, don't know is coded as nine. That concludes our video on coding with variables. Today, we covered some basic information about coding, coding systems, and types of variables in coding, including binary, ordinal, nominal, and continuous. This video provides a quick review of the measures of central tendency, including mean, median, and mode, variation and dispersion, range, quartiles and interquartile range, sample variance, standard deviation, and the normal curve, as well as the 68, 95, 99.7 rule. Before we begin, here's a quick review of symbols we'll use in this video. We will also use the abbreviation IQR for interquartile range. The mean is the average or balancing point. To find the mean, find the sum of all the values divided by the sample size. Here's a simple example of calculating the mean of the age of several participants in a study. The sigma is the sum, and the X bar is the sample mean. After adding the values together and dividing by the number of values, eight, we arrive at our mean, 23.25. We can construct means of binary variables. The mean of a binary variable represents the percentage of ones. The mean is affected by extreme values, which is why we often look at means in conjunction with medians to understand how the data are distributed. In this example, the mean of the values one, two, three, four, and five is calculated by adding the values together to make 15, then dividing the values by five. The mean of this group is three. However, if the values are zero, one, two, three, four, and 10, the mean shifts to four. The median is the middle value of the data. In this example, we have seven different ages. To find the mean, we first order them from smallest to largest and then locate the value in the center. However, if we have an even number of observations, median is computed as the average of the two middle values. The median is not impacted by outliers. Here, our median of the five values is three. If we add the value 10 to the set of values, our median is still three. The mode is the value that occurs most frequently. It is only useful when we have some values clustering together. In this example, the mode is nine. There may be no mode or there may be several modes. There is no single measure of center that is best. If the data are normally distributed, then mean is used. However, if data are not normally distributed, the median is a better measure. Often, we use both to understand the underlying structure of the distribution. There are several measures to examine the spread of the data. They include range, percentiles, inter-quartile range, and variance or standard deviation. The range is the difference between the largest and the smallest value. This histogram shows a minimum value of 15 and a maximum value of 94. The range is 94 minus 15 equals 79. Another measure of spread is the value of each quartile. We take the total number of data points we have and divide them into four parts. The value corresponding to the endpoint of each part is the quartile value. The inter-quartile range is the difference between the value at the third quartile minus the value at the first quartile. The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger. The second quartile, Q2, is the same as the median. 50% are smaller and 50% are larger. Only 25% of the observations are greater than the third quartile. Let's take the age values 15, 35, 49, 65, and 94. The first quartile is at 35. This means that 25% of the participants are below age 35. Likewise, 25% are above 65 years old. The inter-quartile range is 65 minus 35 equals 30 years. Sample variance is calculated as the average of squared deviations of values from the mean as shown here. We square the differences from the mean to provide equal weight to observations below the mean versus those above the mean. Because we square the difference, values that are further away from the mean get higher weight than those close to the mean. Standard deviation is the most commonly used measure of variation. It shows the variation around the mean and has the same units as the original data. It is calculated by finding the square root of the variance. Here's an example of the standard deviation using age data. Note that sample standard deviation is represented by the symbol S. X bar represents the sample mean. The standard deviation is an extremely useful measure. It tells us how close or far apart data are from the mean. The higher the standard deviation, the greater the spread of the data. Here in red is an example of a moderate standard deviation. You can see that the data is spread pretty evenly. The purple shows a low standard deviation in which the data is concentrated near the middle. The blue example shows a high standard deviation where the data is concentrated on the outside. These formulas are important to know well. While software can compute these for you, it's important to know how it's done using simple numbers. Whenever you work with data, you'll have variables that have a center and spread. A very useful rule to know is that no matter what the shape of the distribution, 75% of values will lie within two standard deviations of the mean, while 89% will lie three standard deviations from the mean. So if someone gives you just these two pieces of information, you can make some predictions on where a new data point will lie. However, what's even better in statistics is knowing that for large samples, data are distributed symmetrically and follow the bell curve. The 689599.7 rule states that 68% of the area of a normal curve lies within one standard deviation of the mean. 95% of the area lies within two standard deviations of the mean, and 99.7% of the area lies within three standard deviations of the mean. This rule works for all normal curves, no matter their shape. That concludes our video on measures of central tendency, including mean, median, and mode, variation and dispersion, range, quartiles and interquartile range, sample variance, standard deviation, the normal curve, and the 689599.7 rule.