 Hello and welcome to this session in which we'll discuss the AI CPA data criteria, which in 2020 it outlined three criteria for defining, documenting and evaluating a dataset. It was a brochure or guidance that's issued by its insurance services executive committee and the title of the report was criteria for describing a set of data and evaluating its integrity. And in my opinion, if the AI CPA has given you certain criteria about data and data has been examined on the CPA exam, I strongly suggest we go over this and go over it in detail just to make sure we cover all our bases. So they establish three criteria. The first one is the purpose of the dataset in the description. Simply put, you got to tell us what's the purpose of the dataset in the description criteria to the description of the set of data is complete and accurate and we're going to see what criteria one criteria two and criteria three basically state that the description highlights any missing information. So what the AI CPA is saying that for the data to be a good data, it has to have those criteria. Before we proceed any further, I have a public announcement about my company, farhatlectures.com. Farhat accounting lectures is a supplemental educational tool that's going to help you with your CPA exam preparation, as well as your accounting courses. My CPA material is aligned with your CPA review course such as Becker, Roger, Wiley, Gleam, Miles. My accounting courses are aligned with your accounting courses broken down by chapter and topics. My resources consist of lectures, multiple choice questions, true-false questions, as well as exercises. Go ahead, start your free trial today, no obligation, no credit card required. Starting with criterion one, it emphasizes the importance to include the purpose of the dataset in the description. So simply put in the description, description means in the metadata. Metadata is the data about the data. If you don't know where the metadata is, please look at the prior recording. Metadata means tell me about the data. Well, the first thing you want to tell me is what is the purpose of this data because it's going to allow me to assess its relevance to the need of whatever I'm doing. Because data could be collected for a single specific purpose or a data could be collected for multiple purposes. I need to know what's the purpose of this data. For example, a single purpose data could be sales collected by a retail store for the sole purpose of tracking sales daily and analyzing the performance of different product. That's a single specific purpose. That's why you selected the data. Or you could have a data multiple purpose data. For example, credit card transaction data could be used by credit card company to monitor account activity and deduct fraudulent transaction. It could also be used by credit bureaus to determine credit worthiness. It could also be used by merchants to analyze spending pattern and improve their sales strategies to basically promote what they need to promote the value. It could also be used by financial institution to offer personalized loans and investment advice to customer. So notice it's the same data but it's used for different purposes. I need to know why was the data collected because that's going to give me an idea for the purpose of it initially. What was the original purpose because you might collect the data for one purpose and use it for something else. I need to know this or was it collected for this specific purpose. So you need to know this. Now we're going to look at criterion two and criterion two are going to have many steps. Criterion two basically state that we need to have a complete and accurate description of the data set including the following elements. So I'm going to list those elements and some of the elements I combined for simplicity which is precision, units, population and sample, fields and records, or record, yeah fields and records, sources, time, uncertainty, and filters. Now I'm going to go over each of these criteria separately, not criteria. These are descriptions separately to meet the criterion number two. It means the data should be complete and accurate. Starting with precision. What is precision? It's how accurate you are. Well how accurate you are. Well it all depends. The required level of accuracy and the data may differ depending on the purpose and the goal of the analysis. What does that mean? For example, if you are performing analytical procedures and a thousand dollar range might be sufficient. So if you're doing and depending on how large is the company but if you are computing the current ratio, if you are computing the change in sales from year to year and we're dealing with you know millions of dollars, a thousand dollar accuracy is not an issue. However, if you are preparing a bank reconciliation, well now we need to require a more precise measurement. We have to go down to the penny. So the accuracy and precision should be aligned with the specific need of the analysis. So inserting data we have to be precise and others not. So the precision depend on the need. For example, you know when we have those political survey they would say it's a plus or minus error, a plus or minus three as a margin of error. That's acceptable. Survey of public opinion, that could be the case. However, if we are conducting a scientific experiment using data for that, we may require specific measurement to the third or fourth decimal. So we have to be extremely extremely accurate and specific. So precision depends on the data and the use of that data. But we need to know what is the precision that we are using. Also we need to know about the units because we're using numbers. The measurement unit of the data element are important to be are important because we have to specify what are we using without knowing the unit. We don't know. What are we dealing with US dollar? Are we dealing with a Japanese yen? Are we dealing with euros? Because it does make a difference. A currency can be expressed in many forms, dollar, pounds, euros, even within currency. For example, are these numbers in thousands, hundreds of thousands, millions, so on and so forth. So it's crucial to state the measurement of unit for each field. And we're going to see what field is to ensure accurate interpretation of the data. So we need to know, I'm looking at numbers, but what do they really represent? And really, this is very similar to the monetary unit assumption that we discuss when we study basic accounting or intermediate accounting. Numbers has to be expressed in monetary monetary unit. Numbers has to have a specific currency. Also, we need to discuss population and sample. What is a population and what is a sample? The population is the entire group of the units that we aim to describe or predict. So that's the whole population, everything. A sample, on the other hand, could be a small part of this group. A sample, you sample, you don't, for example, you don't, you don't, you don't test everything. Okay. So the sample versus the population, we need to understand with the advancement of big data. Now it's very common to have samples that have, that include all the population. So now when we sample, because of technology, we can have the whole thing. We don't have to sample. We can take a look at the whole data. So when we do sample, we have to be careful when we do sampling and collecting everything. Even in big samples, some numbers of the population might be missing. Now why they could be missing? If we're, let's assume we are collecting sales. Maybe some sales were not recorded. Maybe missing data due to failure due to technical difficulties. So we have to be aware of this. The auditor must determine if the samples still meet the audit objective. Because when we say we are treating the whole population as a sample, we want to make sure the whole population indeed include all the population and nothing is missing. Also, we need to know the difference between what a field is, a record and a table. So I'm going to start with, usually everyone starts with talking about fields, go from fields to record and go to table. I'm going to go the opposite way. So this is a table. I'm going to, you know, we're going to break it down. This is a table. This is a table for customer, customers. It includes the first name of the customer, last name, the street, the street number, the address name, city, state, zip code, phone, tax ID, date of birth, and whether this customer is a local, considered local for us or non-local. So this is the table, including everything, including the name, everything. Now within this table, we have record. For example, we have a record for Adam. We have three records, record for John and record for Mark. What is a record? A record is a group of related fields or attributes and these are all attributes, first name, last name, addresses, phone number, stacks, ID. So a collection of all of those would create a record. So this is, so let me highlight the record in yellow, this way you would see it. So this is one record, one record in that table. Okay. And it happens to be for a customer. Now when we say an entity, just kind of, just know that describe a single instant of an entity. Sometime we could use the word table, we could use the word entity, but if we're dealing with a day, completed database, we're dealing with a table. So it's basically the same thing. Now within this record, okay, within this record, we have the actual name, we have the data value name, first name, first name, last name, the number of the street, name of the street, the city, the state, so on and so forth. A data value is a specific value found within a field which can consist of a single character, such as yes or no, multiple characters, such as a date, text and numbers. Now these are the values. They could be taxed, for example, under the zip code, they can only accept numbers. Under the phone, they can only accept numbers, including this format. Under the tax ID, it's a two digit, then a dash, then one, two, three, four, five digits. So it's programmed that way. So the fields are specific, date of birth has to be, it has to be presented this way, two, two digit month, two digit, two digit day and four digit year. Okay, so those are specific, those are, those are the data value in the specific field. Now what is a field? That's the field. Now a field and a data structure refer to group of character. Now these are 555 is a group of character, Farhat is a group of character, Main Street is a group of character that represent a characteristic of an entity. So a field is basically, this is a field, this is a field, a bunch of fields, a bunch of related fields together. So again, this is a field, this is a field, this is a field, this is a field, and each field could have numeric or text or yes or no, then if you combine all the fields will give you a record. If you combine all the records, you will get a table. Okay, so the data has to be organized that way to be a good data. Sources of data, that's important, because when you're performing data analytics, your conclusion is as good as your data, and your data is as good as the sources of that data. So it's crucial to identify the source of the data in order to evaluate its reliability and credibility, because if the data is no good, it doesn't matter how good is your analysis. So the credibility of the data from a client management that's deemed trustworthy will be higher than the credibility of the data from a client management that's considered unreliable. So I need to take a look, where is the data coming from? So the description of the data source should be clear and specific also to enable the reproduction. If I want to go back and do exactly what you did, if I want to go back to the source, I should be able to do so. A user should be able to recreate the data by using the description of the data source and access in the source. Now bear in mind sometime we get the data, then we clean it, we extract it, and we'll talk about this later on, but then that's different. But what I'm saying, the original data should be the same. Also, we need to know about data uncertainty or variability. What is variability? The degree of variability and uncertainty present in data element and their population can be indicated through various measures. The best way to explain variability or uncertainty in data is to use, for example, I always use class average. So let's assume I have two classes, class A and class B, and both they have an average of 75. In class A, I could have the following sets of grades. One get 90, 20, 30, 95, 92. So the grades would look something like this. In other words, some high, really high grades and some really low grades. Then in class B, so I have class B here, under class B, the grades look something like this, 74, 76, 75, 78, 74. So notice here, I'm going to have an average of 75, but the grades are all clustered around each other, clustered around the average. They both have an average of 75, but A is more variable. The grades varies. So that's what the degree of variability. Now, how would you study or how would you compute the degree of variability? You can compute the standard deviation. All what you need to know is we need to know what's the degree of variability in the data, or sometime in the margin of error in polling, we kind of mentioned this. For example, in margin of error of plus or minus 3% at 95 confidence level indicate that if a survey was conducted 100 times, the result will be 3% of the true population value 95% of the times. So you're not sure, but you have some variability. You want to measure that variability. Time is important. The temporal aspect of data is crucial for its interpretation and application. When did you get that data? What period does it cover? Knowing the time frame of the data. Enable the user to determine if it's complete or additional data is needed. So for example, if you're analyzing a crime data, it's important to know the time period. Is it during the night? Is it during the day? Such as the number of reported crime in a specific year or quarter, in which quarter is this happening mostly in the spring and the summer and the winter? That's relevant data. When did it happen? When did it happen? Okay, so the information can give an insight into trends and pattern in crime over time and help inform decision. For example, law enforcement about crime prevention and law enforcement strategies. For example, if it's happening from 8pm till 3am, well, if that's the case, if most of the crime are reported in this period, then we know we need more police officers or more law enforcement power during that time. For example, if a data set of a website visitation is missing 25 days or from 2pm to 6pm data, then it's important to determine how to obtain the missing data time because that time is important. Okay, so yeah, we want to make sure we have the proper time. Nothing can be missing and the time is there. We have to know what time we should be able to filter the data. The filtering criteria used to determine the inclusion or exclusion of data elements in the data set must be specified when the data set is filtered. So if we do filter the data, we have to let the users know. This help to understand the limitation of the data and ensure that the transparency of the analysis process. So we have to let them, if we filtered anything out. For example, if a data set of car sales include only sedans that were sold in the past year, this should be clearly indicated to show what's filtered or talking about only sedan. The criteria for including only sedan and executing others as well as the time period of the past year should be specified. So the data, you have to be very specific. You have to be very specific. And the third criteria says that the data description highlight any missing information necessary for a complete understanding of each data element in the population. So if there's something is missing, it's basically, sometimes they call it, there's a gap, there's a limitation in the data. We have to talk about this. Highlight what information is absent, not there, so the users are aware of any limitations or gap in the data. If something is not there, you have to tell the users, I use this data, but you have to be aware, ABC, I did not have this information. This helps users to understand the contacts and limitation of the data and make informed decision. Maybe because of that limitation, the whole study is no good, because I'm not interested in this unless I have data about this particular topic. So for example, a data set that includes information on real estate sales transaction may have a description that highlights the fact that it's only includes sales transaction that occurred within a certain period and geographical area. And that other sales transaction may exist outside the scope. I can give you a good example, a realistic example. I'm selling my home now, and I'm selling and buying, and I live in the greater Philadelphia area. And I have a real estate agent, so he would send me a report every on a weekly or monthly basis. And here's what's happening. If we look at the supply, if we look at the supply, we see that if he gave me the average supply, the average supply would look something like this, like flat. So the supply is not up, is not down. Once he, once he executes the city of Philadelphia, so if you execute supply means people are selling their home. It looks like on average, the number is average. Once you include the city of Philadelphia, the supply would look something like this. There is not enough supply. In other words, if you want to buy a house in Philadelphia, okay, if you look only at Philadelphia supply, it's up. The supply is up in the city of Philadelphia. However, once you exit, once you include them, once you include Philadelphia and not in the surrounding area, it flattened. So it's important to know, don't give me the whole area, you know, don't give me the greater Philadelphia, because if you give me the greater Philadelphia, it looks like, you know, on average, there's enough supply. Really, there's not. So if you exclude the city of Philadelphia and look at the suburb, there's not enough supply. So that's very important to understand is if information is missing from the picture. The missing information is necessary to understand the limitation of the data elements in the population represented by the data set. So this is basic, those are basically three criterias that the AICPA believe are a good description of a good data set. And because the AICPA is saying so, I strongly believe it's important for the CPA exam. What I suggest you do is to complete a few multiple choice questions, be familiar with those topics. Because who knows, you might see something like this on the CPA exam, whether we're talking about the current CPA exam or the new 2024 CPA exam, just and it's kind of, you're going to be more familiar with data. Good luck, everyone, study hard, and of course, stay safe. FORHAT Accounting Lectures is a supplemental educational tool that's going to help you with your CPA exam preparation, as well as your accounting courses. My CPA material is aligned with your CPA review course such as Becker, Roger, Wiley, Gleam, Miles. My accounting courses are aligned with your accounting courses, broken down by chapter and topics. My resources consist of lectures, multiple choice questions, true-false questions, as well as exercises. Go ahead, start your free trial today, no obligation, no credit card required.