 Hello everyone, today we are going to discuss on the topic how to prepare data for machine learning part 2. Let us see the learning outcome of this topic. At the end of this session, students will be able to demonstrate how to prepare the data to build effective machine learning model. So when we are going to decide or when data analyst or data scientist is going to decide to build an effective machine learning model, then he has to go through the different classification parameters to build an effective machine learning model. These parameters are defined in various ways to become the outcome of the machine learning model. So, the first parameter is articulate the problem early. In this knowing what you want to predict will help you decide which data may be more valuable to correct. When formulating the problem, conduct the data exploration and try to think in the categories of classification, clustering, regression and ranking that we already talked about in the previous videos. These are the things which are very useful for the business application of machine learning. So these are totally important for the data analyst and data scientist to remember to build an effective machine learning model. Then we go for the second parameter as a classification. If you want an algorithm to answer binary yes or no questions like for example, good or bad or we can say sheep or goats or we can say cats or dogs or you want to make a multi class classification like grass, trees, like in this way birds etc. In this case we use the classification parameter. Here you also need the right answers labeled so an algorithm can learn from them. Third parameter is a clustering. In this if you want an algorithm to find the rules of classification and the number of classes. In that case we use the clustering algorithm. The main difference from classification task is that you do not actually know what the groups and the principles of the divisions are. For instance this usually happens when you need to segment your customers and a specific approach to each segment depending on its qualities because clustering mainly focus on the qualities of the particular data when you are going to build. Next fourth parameter is a regression. If you want an algorithm to yield some numeric value for example, if you spend too much time coming up with the right price for your product since it depends on many factors. Regression algorithm can add in the estimating this value. In that case we use the regression values. And fifth parameter is ranking. Some machine learning algorithms just rank objects by a number of features. Ranking is actively used to recommend movies in video streaming services or show the products that the customer might purchase with a high probability based on his or her previous search and purchase activities. So these five parameters are very important to build an effective machine learning model. Then very important thing is we have to establish a data collection mechanisms. So while establishing data collection mechanisms we should know the what type of parameters required for this data collection mechanism because we know that data preparation and data collection is a very important part of machine learning. In that case we have to carry out those parameters while establishing a data collection mechanism. The first one is data warehouse and ETL. We know the data warehouse is the very large storage area where we store the data in various format. The first one is depositing data in warehouses. These storages are usually created for structured records that is SQL, structured query language records meaning they fit into standard table formats. Why it is a standard way format because it is safe to say that all your records for example if you store a sales record, if you store the payroll records or if you have any customer relationship management data which are stored in the particular data warehouses. So that it is safe to store these kind of data. Another traditional attribute of dealing with warehouses is transforming data before loading it there. So here we will talk more about data transformation techniques that is data transforming from one structure to another structure. But generally it means that you know which data you need and how it must look. So you do all the processing before storing this approach is called as extract, transform and load that is called as a ETL. So that is why while establishing data collection data warehouse and ETL are the most important parameter in the machine learning. Next is the data lakes and yielding. What is a data lakes? Data lakes are storage capable of keeping both structured and unstructured data including images, videos, sounds, records, PDF files everything. You get the idea when we use this data means we get the idea how this data will be prepared to learn the machine learning, to learn machine learning algorithm. But even if the data is structured it is not transformed before storing remember. You would load data there as is and decide how to use and process it later on the demand. So this approach is called as extract, load and then when you need transform. So that is why it is called as ELNT extract, load and then when you need it is a transform. Next is human factor error. Another point here is the human factor. Data collection may be a tedious task that burdens your all the organizations with the instructions. If people must constantly and manually make records the chance are they will consider these tasks as another important which and let the job decide. For instance, for example, Salesforce provides a descent tool set to track and analyze Salesforce activities but manual data entry and activity logging are alienates to the Salesforce. Means very difficult to the Salesforce. So there are some data which are handling through the human factor principle. Then check your data quality. So while after collecting the data what is the main important part is really your data meet with standards or really your data is a quality. How we check this? So how we check this? So how tangible is a human error? If your data is collected or labelled by humans check a subset of data and estimate how often mistakes happen this is very important. Next is where there any technical problems when transferring data. For instance, the same records can be duplicated because of server error or you had a storage crash or maybe you experienced a cyber attack. Evaluate how these events impacted your data. This is very important because we have to check these data quality before making a data set to the public. Next is is your data adequate to your task? Yes we have to check whether the data is adequate to perform a particular task which is going to build on machine learning. Is your data imbalanced? Of course when we going to transfer the data check your data is imbalanced or not. If the data is imbalanced before transforming we have to check the quality of the data. Formate data to make it consistent whether your data is consistent or inconsistent. Take the data cleaning process which is very important. Then decompose data which is required to upload and rescale the data means measure the data in the various format whether it is a structure or whether it is unstructured or whether it is labelled or whether it is unlabeled and then you are going to decide those data sets into the public that is called as the public data sets. These are the ways to check your data quality and then only we can use those data in the public environment to run a machine learning algorithm. So that is why how to prepare a data to build an effective machine learning algorithm is really important of the following all these procedures. Next we have to think about what is data preparation? So while preparing the data we should check the data types of the data data requirements data errors and data complexity. So implementation of machine learning algorithm required data to be numeric which is called in the data types and some machine learning algorithms impose the requirement on the data which is comes under data requirement and some data have a statistical noise and errors in the data which is need to be corrected before transforming the data. And some complex non-linear relationships may be raised out of the data which is comes under data complexity. So these are the things we need to consider while the data preparation. Next is there are some data preparation techniques which is comes under standard data preparation techniques. One is data cleaning that is identifying and correcting mistakes or errors in the data. Features selection identifying those input variables that are most relevant to the task data transform changing the scale or distribution of variables feature engineering deriving new variables from available data that is it from existed data and dimension reduction creating a compact projections of the data. These are the things which we are going to remember. So that is why I am telling you all these points while maintaining data quality and following the standard data preparation techniques. After that we collect the data in the different forms which is authenticated data. Authenticated data can be collected from the source like data.gov.in or Kaggle or UCI data set repository. Then we have to mention what kind of data we need to execute to perform the task or to carry out the research. Because when we decide the particular data to execute the task then good data ensures that the results of the model are valid and can be trusted upon. So that is why this is very important to decide what kind of data we need to execute while performing the task or the research. Then prepare the data in different ways like can the data can be performed either manually or from the automatic approach can be prepared in numeric forms which would faster the models learning. For example I would say that an image can be converted to a matrix of n-bind dimensions where the value of each cell will indicate the image pixel. So in that way we can perform the data manually or from the automatic approach. So inputs are very important. Once data is collected once data is prepared check whether data quality is really a good quality or whether it is authenticated because these are the data which are going to input to your machine learning algorithm. So that is why it is very important. So for this we require algorithms conversion algorithms are needed. And once we use the conversion algorithms which compute the high competition and derive the data accuracy. This is the outcome actually from the given input. For example if we have collected data through the source like Twitter, comments, audio files and video clips these data should be a converted data and which we get the accuracy after transforming the data. So that is why input is very important to the machine learning algorithm. Then process the data once we give the input there are so many algorithms and machine learning techniques are there which are required to perform the instructions. And they provide a large volume of data with accuracy and optimal computations. And these processing are 100% gives the real outcome of this machine learning algorithm. And the output we when we get are in the form of machine learning in the meaningful manner and output can be in the form of reports, graphs and videos. Because data analytics are very much interested outputs in the form of reports and graphs and videos because they want to analyze actually the data from this output. Then finally we think about the storage which the obtained output and the data model data and all the useful information are saved for the future use. So these storage may be a data warehouse where large amount of data can be stored which we can use for the future use. And we can make those data as a public data set environment. After studying this there is a question on this topic which of the following is an example of a feature extraction because feature extraction is a part of a data collection as well as a data preparation. So the options are constructing bag of words vector from an email. Second option applying PCA principal component analysis projects to a large high dimensional data. Third option removing stoppers in the sentence and fourth option all of the above. What is the answer of this question? Answer of this question is all of the above because when we use the feature extraction in data preparation first we construct all the data set from the different sources. Then we use the principal component analysis to become a high dimensional data and then we remove the meaning error data that is topo data in the particular sentence. So that is why these three options are really correct so that is why the answer is all of the above. This is the reference. Thank you.