 In the previous video, we saw what are missing values and what are the types of missingness and then based on that what we can do, either we can delete or we can improve. In this video lecture, we will see what are outlayers, some details of outlayers, some details of encoding techniques and some details of scaling that is data transformation. To start with, what is outlayer? What are the types of outlayer? We will understand what are outlayers. Outlayers are generally classified using three categories. One is global outlayer, collective outlayer and contextual outlayer. Global outlayer are, these are the simplest form of outlayers in the given data set. A data point strongly deviates from all the data or the rest of the data points. It is called global outlayers. So you can take this example. So this can be outlayer, it is clearly evident. A student writes all his exams in right hand. In one instance we get it as left hand. So there is a chance that this can be an outlayer because of some error in data decking. This is one of the example of global outlayer. Global outlayer is easy to find. Now there is something called collective outlayer. Some of the data points as a whole deviate significantly from the rest of the data set. These are called collective outlayers. Say for example in this case this is a set of data which is actually an outlayer. Now these are the set of data points as a whole which deviate significantly from the rest of the data set. You can see the figure and you can make out. This is called collective outlayer. Now one more type is contextual outlayer. This is one of the examples of contextual outlayer. What is contextual outlayer? Data obtained deviate significantly from other data point based on the specific context or conditions only. For example in Northern Hemisphere summer starts in May and you want to observe a value corresponding to that with respect to the temperature. So you can see there is a sudden drop in June instead of somewhere around here. This is here so you can make out based on the context that this is an inconsistent data and it is an outlayer. So this is called contextual outlayer. So we learnt three types of outlayer. One is global outlayer. This is the example of global outlayer and collective outlayer where a group of data points deviate significantly and third one is a contextual outlayer. Based on the context we can make out there is a inconsistency in the data. How to identify the outlayer? One of the ways to identify the outlayer is using box plots. This is an example of box plot. So we have a minimum value, we have a maximum value and we have a median. From that we will have third quartile and the first quartile. The entire range between third and the first quartile is intra quartile range. I will not go through the formula of how to calculate the box plot. I assume that you already know that because it is already discussed in the previous lectures. I will take an example. Now we have six instances, 1 to 6 are the values. Now if I represent the same in the box plot it looks like this. Here 0 is the minimum value, this is the minimum value and this is the maximum value. This is the maximum value which is the response. Now the median is 3, this is 3, corresponds to 3 and the third quartile range is 4.5 and the first quartile range is 1.5. Now similarly you can do it for few other cases like this is one set of data, this is second set of data and this is third set of data. You can pause the video and you can solve it and you can find out which one is the out layer among the three sets. Now we will discuss each in detail. We have the data 0 to 6 in this case. So as you can see 0 is the minimum number and 6 is the maximum number. Since most of them are 6, the box plot looks like this. If you apply the formula and calculate 3 will be the median and if you go till 6 and 0 is the minimum value, the quartile range is 1.5 to 6. That is the quad quartile is 6 and the first quartile is 1.5. Now we will go to the second case. In the second case, so we have 0 to 9. Again 0 is the minimum number and 9 is the maximum number. This is 9. Now we have the median which is 3 and then we have 4.5 and we have 1.5 as the first quartile and the third quartile. You can see that the entire data is represented within the box plot. If you apply the formula, so the entire data falls under the box plot. Now let us consider the third case. Is it an outlier? Is any data present in this set is an outlier? We will see. So if you plot it, we see that there is one point which is an outlier, which is nothing but 10. Now 10 is an outlier and it has plotted the data with respect to 0 to 5. 0 is the minimum value and 5 is the maximum value and the median is around 3 and then the third quartile range is 4.5 or something and the first quartile range is above 1. So now 10 is an outlier. How did it get? How to calculate it is you just take the interquartile range. How to calculate the interquartile range is this is the third quadrant, this is the first quadrant. You take the values 4.5 and 1.5, you will get the range that is 3. Now you got the value 3. From that you will multiply by 1.5, you will get 4.5. So if you add 4.5 from the third quartile range, it goes till 9. So it goes till 9. You can see here. So that is why the previous data which contained 9, even it plotted within the box plot. Here it plotted outside this plot, why because it is an outlier. It is above the range of 9. Similarly, an outlier can exist even below. This is also possible. This is one case. This case is also possible. One of the ways to find is actually using the box plot. So if you want to find an outlier using a box plot, this is one of the method where you can plot it and see and then based on that you can see what is the outlier. Now we will go ahead and understand what is categorical data. Now there are two types of data, quantitative data and qualitative data. Quantitative data are numerical and qualitative data are actually generally strings or some categories. Now the qualitative data which is a categorical data can be divided into nominal and ordinal. What you mean by nominal and what you mean by ordinal. Nominal are just categories and they do not have an order. Ordinal are actually categories but they have an implied order in it. Using quantitative data we have numerical data, it can be either discrete or continuous. Discrete some numbers, continuous is any numerical value. This you already know. Now coming to the categorical data. So you know that this is divided into some categories which are generally represented in strings. Now how does the computer will understand this? Do you have to convert this to a numerical format for better processing? What is ordinal data? Ordinal data are the categories which have an inherent order. For example, educational qualification like high school, masters, bachelors, PhD. So there is an order. So if the order also as a feature we should consider in our model then it falls under ordinal. Now there are some classification like animals. You are doing a classification of lion, tiger, cats, dogs and so on. So if you are doing that there is no inherent order as such. In that case we can use nominal, nominal categorization. Now how to convert the given data into a numerical format? So that the method uses it to perform some processing. So this type of conversion is called encoding categorical data. Categorical encoding is a process where we transform the categorical data into numerical data that is called encoding categorical data. Now why is it important? The performance of a machine learning model not only depends on the model and the hyper parameters but also on how we process and feed different types of variables to the model. Since most machine learning method only accept numerical variables and hence pre processing the categorical variable becomes a very necessary step. This is the main reason why we do the encoding. Now there are various ways in which we can do the encoding. We will learn only few ways. The first one is label encoding or the ordinal encoding. As the name suggests it is for the ordinal categorical data. Ordinal categorical data you know that these are the categories that have an inherent order. You have primary school, then high school, then bachelors, then masters may be after that some PhD. So there is an inherent order in it. So if there is an inherent order and if you are using it in your data then it is better to do ordinal encoding which is also known as label encoding. So what we do here? We just convert the given data into some integer values. High school is 2, primary school is 1, masters is 4, bachelors is 3. So this type of conversion is called label encoding or the ordinal encoding. This is very easy and it is very informative as well. Now the question is since it is very easy can we do this for the categorical variable which is nominal. Can we use this label encoding for a categorical variable which is nominal in nature or which is non-ordinal. So if it is a nominal data can I apply label encoding? The answer is no. Why? Because it will give an order. So if you are classifying into cats and dogs and then if you are using label encoding one is superior over the other that is what it conveys. So if you do not want to do that generally we do not use the label encoding. Now what to do in that case? There are various algorithms, there are various encoding techniques which are used for the nominal data but we will discuss a few. The first one is one-hot encoding. What does one-hot encoding do is? Let us take this example origin which contains USA, Japan, Euro 3. Now since we are not worried about the ordinal, ordinality here it is just an information. So if you are classifying this type of data and nominal data we can convert this into some columns. So now we have 3 distinct categories. So we will put 3 columns and we will do it. So USA is 1 0 0, Japan is 0 1 0 and Europe is 0 0 1. So now if you change this say for example for USA if you give 0 0 1 and Japan same as 0 1 0 and Europe has 1 0 0 still it works ok. Just that you have to give it 3 distinct numbers. If there are 3 categories you will use 3 columns 1 0 0, 0 1 0 and 0 0 1 the order can be anything. So you can give Japan 1 0 0, Europe 0 1 0 and USA 0 0 1 it will not matter. This order will not matter. So how many categories are there? That many columns you will create and you will fill. So how it happens is in one-hot encoding for each level of a categorical feature we create a new variable. Each variable is mapped to a binary value, binary variable containing either 0 or 1. That 0 represents the absence and 1 represents the presence of a category that is all ok. What is the demerit of 1-hot encoding? So there are 2. One is it introduces the sparsity in the dataset, another one is it creates dummy features in the dataset without adding much information which is called dummy variable trap. So what do you mean by that? So to understand that we will see this. So if you want to classify all the living organisms present on earth. You will consider all the class levels and you want to consider this and do the classification. That is your problem. Now if you want to do that this n is very huge and if you apply 1-hot encoding you will have n columns. The problem with this is you will have very little ones, you have too many zeros. So this is called dummy variable trap. So it increases, it introduces sparsity in the dataset. So when you have this type of data where the categories are very huge in that case generally we avoid using 1-hot encoding. What can we use? There are various other methods as well. So I will be discussing binary encoding. Binary encoding is a mixture of hash and 1-hot encoding. What it does is you have n categories. From n categories first you will convert into some integer value. So here we have hot code very hot and warm and it gets repeated. So there are 4. Now I will convert it into some numeric value, numeric integer value. From that I will convert it to binary. So once I convert it to the binary then that binary value is converted to the 1-hot encoding. Now we can see 0010 is represented as it is but 3 is represented using 011. So 011. So this type of encoding is called binary encoding. It is not just directly converting into the binary format that will not reduce the complexity. It will be just if it is n then it will be just n-1. So this will not do much if you directly convert just to the binary. So that is why we use binary encoding which does both hash and 1-hot where we convert it to integer value then convert to the binary. From binary we convert to the 1-hot encoding. So this is called binary encoding. We will go to the next topic that is data transformation. So you are transforming the given data into something else. One of the best examples is data scaling. How are you doing the data scaling? To understand that we need to understand why actually we have to do the data scaling. So data in the sense the attributes, attributes values are there so we use the word feature. Attributes and features are closed. If an attribute which is actually significantly contributing to something we call it features. Significant contributing attribute is nothing but a feature. This is used in general. There are various definitions for that as well. So now why you have to do a change in the attribute, why you have to do some scaling, why you have to transform something in the attribute or the obtained value is the question. Let us take this example, data pre-processing course. We have a data pre-processing course which is given for order let us say. Then the age group will be 27 to 50 and then we have time spent, how much time they spend to learn this course and what is the difficulty level of the course, all the course lectures are converted into some difficulty level like easy, medium and hard or easy intermediate and expert. Some three class levels are there. Now from this you have to see what is the result. So is there any correlation among the result or not, that if you want to check. So you cannot apply directly to the method, why because age ranges between 27 to 50. Now what is the time spent? Time spent is represented in seconds, 48 second is approximately some 13 hours, 13.7 hours and 83 is approximately like 23 hours. Some learner has used around 13 hours to complete. Some learners have used around 23 hours to complete. Now this is a huge amount of data, 48K and 83K is very huge compared to the age. Similarly, it is very huge compared to the three different class levels of difficulty level, easy, medium, hard or easy, intermediate and expert, novice, intermediate and expert. Now how to do this? Most of the machine learning methods uses distance metrics to measure in their computation. Now if it uses such sort of thing, all these variables will not be of significant impact. It only considers the time spent. We have to normalize this type of data in such a way that all the attributes contributes equally. That is why feature scaling or the data scaling is required. Now how to do feature scaling? Before that you need to know that there are few algorithms which does not use the distance metrics. In that case you can directly use it, you need not do the feature scaling. But most of the algorithms use actually distance-based metrics to do some analysis in their methods. So in that cases we have to do the feature scaling. Now how to do feature scaling? We can do it in two ways. One is normalization, another one is standardization. What is normalization or min-max scaling? Normalization is also called min-max scaling. So we just take the value and we know the minimum and the maximum values based on that we will generate a new value. So it will create the data within a range. This is min-max scaling. Now what is standardization scoring? It uses mean and standard deviation to calculate the new value. This is the formula for that. So we can use normalization or standardization based on our requirement, based on the requirement of the data. Now what are the differences in normalization and standardization? There are quite a bit of differences. Normalization has a minimum and maximum value of features that are used for scaling. Standardization, as per the definition, it uses mean and standard deviation. Normalization is used when the features are on different scales. But the standardization is used when we want to ensure zero mean or unit standard deviation. And also normalization always scales between a range. That range you can define or based on what you require we can specify. So it always scales between the range. But here it is not bounded to any certain range, the standardization. If you do not do outlier detection and all those things in your preprocessing steps, generally it is not recommended to use the normalization. It is always better to use the standardization techniques. Now it is useful when we do not know the distribution. The normalization is always useful when we do not know the distribution. But in standardization if it falls under normal or Gaussian then we can directly use the standardization. There is a other name as well. The standardization is also called Z score normalization whereas the normalization is called scaling normalization. These are the few points on the scaling. With this I want to conclude this video by telling some of the open ended question. Should we always scale our features is the question. One of the answers I already told you if your method is not dependent on some distance metric then it is not required to use some scaling techniques. That is one. Apart from that are there any reasons where we need not use the scaling? The second question is is there any single best scaling technique? The third question is how different scaling techniques affect different classifiers? And the fourth question is should we consider scaling technique as an important hyperparameter of a model? We discussed hyperparameter before I assume that or else we will be discussing it again. Now the thing is these set of questions are valid for each and every step of data preprocessing. What you mean by that is for missing values or data cleaning, reduction or anything, any major task in subtasks like missing values. So or do you really need to scale? Do you really need to fill the missing value? Is there any single best method to do that? How different techniques are used and how it affects a different classifier? And how this matters as a hyperparameter for doing the classification? Similarly you can map to all the concepts which we learnt including encoding, outlier detection and all those things and you can try to understand how best we can find a solution for a given setup problem. If you try to answer this question you will get some idea about what are the best ways to handle this? What are the various ways available and what is the best for your data or your problem statement? These are the few set of sources where I took the data for my lecture videos. Now to summarize we have what are the types of data, global data, collective data and the contextual data. These are the three types of data and we saw what are the differences and what are the definitions. Among them we tried to understand how to find an outlier using a box plot with just a box plot whether we can find an outlier or not. After that we tried to understand why there is a need to do the data encoding. In data encoding we have numerical data and also categorical data. Then we saw what is categorical data encoding. In that we saw there are two types of data, one is qualitative, another one is quantitative, one has only numbers and another one has strings or categories. When you have categories you need to convert the categories into some numbers that is called categorical encoding. In categorical encoding there are two types, one is for the data which has nominal and another one is the data which has ordinal in nature. Now the data which are nominal in nature for them we have to use some one-hot encoding or the binary encoding. For ordinal in nature we have to do label or ordinal encoding. Ordinal encoding are those which has an inherent order and nominal encoding data is for those data which does not have an inherent order. For those type of data we saw what is one-hot encoding. In one-hot encoding we saw the sparsity of the data. The data can be sparsed if it is too large then too many zeros will be there then once. So in that case we may not use this. So we used binary, binary encoding, binary encoding is a mixture of two different encoding techniques and then there we saw that it is converted to some integer value then binary value and then it is converted to one-hot encoding. So these are the two ways. So these are the two ways we do for nominal data. Now after that we saw one aspect of data transformation that is scaling. Scaling can be done using normalization and standardization. Both has its own merits and demerits based on the data or the requirement we will choose whether which scaling method is used. There are chances that even scaling may not be required. In that cases we do not use the scaling. To summarize the entire lecture on data preprocessing we learnt at an abstract level what is data processing, what are the types of data processing that is data preprocessing and data post-processing. And also we saw why there is a need to do the data preprocessing. We understood few complexity issues like time complexity, space complexity, whether it makes an efficient algorithm or a method or not, what is the effectiveness of a problem, how errors are there and how errors are handled. And also we saw what are the major tasks in data preprocessing such as data cleaning, data integration, data reduction and data transformation. So we saw in detail what are the types of missing values. Types of missing less also we saw about missing values. What is missing values, why missing values will occur and what are the types of missing less. MNAR missing not at random, MAR missing at random, MCR missing completely at random. So if it is missing completely at random we can directly delete it. For these two cases we have to do some imputation or we have to collect the data in a better way. So we understand what is outlier, what are the types of outlier, global, collective and contextual outliers. And we saw box plot, how box plot is used to detect an outlier. And also we saw all the categorical data needs to be converted to the numerical format and to convert that we have categorical data encoding. In that there are two types of categorical variables, one is nominal and then one is ordinal. For nominal we have various methods like one hot encoding binary encoding and so on. And for ordinal the most popular one is ordinal encoding or the label encoding. So ordinal encoding when you have a inherent order, in that case you will just encode it by converting it to a integer value. And then we have one hot and binary encoding as well for the nominal data. After that we saw one of the aspects of data transformation. In data transformation we saw what is scaling. In scaling there are two methods standardization and normalization. In this video lecture we saw only the some basic concepts of data preprocessing and an abstract level why data preprocessing is required. It is not an in-depth study of why data processing is required and what are the various steps involved and for each type of data what are the various methods available for data preprocessing. I hope you enjoyed the videos, in case of any queries let me know. Thank you.