 Welcome to Learning Analytics Tools course, I am Dr. Ashwin T.S. from Education Technologies Department IIT Bombay. In this course, you already learnt how to collect the data, how to process it using various techniques like machine learning techniques to do some clustering, classification and so on. In this lecture, we are going to learn what is data preprocessing. Let us understand what is data processing. We will try to understand at an abstract level what is data processing and what is processing in general. Now, we will see given that raw data, you collected the raw data and you processed it using various machine learning algorithms and then you try to analyze some information out of it. We are collecting the raw data, from that information we are doing some higher level synthesis or extracting knowledge. This is the general cycle from data, data to information, from information to knowledge. Now data is a raw data. From the raw data, you have to process it to get some information. So, while processing you might have used some machine learning algorithms in your course to extract some useful information. Now, it is not always true that a raw data can be used as it is while doing the processing and extract some information out of it. So, we need to do something where it is more useful or where it is, where it can be used in an efficient way to extract the information or to do the processing. So, this step is called data preprocessing. So, here data preprocessing will come. Now, similarly it is also true that all the extracted information cannot be directly used to do the higher order synthesis or to extract knowledge from them. In that case, we use data post processing. In this lecture, we are more concentrated on data preprocessing. So, what are the general types of data processing? We have preprocessing and then we have post processing. Now, what do you mean by data preprocessing? The definition of data preprocessing is data preprocessing is a data mining technique which is used to transform the given raw data into useful or an efficient format. So, converting the given raw data into some useful or efficient format, one more definition goes like this. Data preprocessing can refer to the manipulation or dropping of data before it is used in order to ensure or enhance performance and it is an important step in the data mining process. So, all these steps comes under data mining, data to information, information to knowledge. How the given raw data can be converted to useful data? This step is called preprocessing. Then we process the data and extract some information. From information, we extract the knowledge. If you are not able to extract the knowledge directly, then we again process that data that is called post processing and then we see what knowledge can be extracted from the given data. This entire set of information comes under data mining. Now, here there are few terms. One is useful and efficient format. What do you mean by useful? Useful for what and efficient? What do you mean by the efficient format? To understand this or in other definition, there is something called ensure or enhance performance. So, how to do this? What do you mean by this? We will try to understand some of the concepts of the computer science in general at an abstract level to know what do you mean by useful and efficient format or ensure and enhance performance of a given algorithm method or something. So, since I use the word algorithm, let us see what is an algorithm. You all know that algorithm is a step-by-step procedure to solve a given problem. It is a very simple one. So, we are writing step-by-step procedure to solve a given problem. Now the question is, can I write an algorithm for any given problem? Can I write algorithm in a computer science or a computer machine to any given problem and solve it? Is it possible? It is the question. The answer is obvious. It is not always true that we can write algorithm for any given problem. It is not true. There is a better way to understand this at a very high level. It can be classified into three different types. One is a problem, a given problem, can I write an algorithm which I can run it in some given amount of time that is say polynomial time P stands for polynomial there. So, can I write an algorithm which runs in a given amount of time? So, that is called writing an algorithm or if I cannot write an algorithm which runs in a given amount of time, can I give a solution of it algorithm? Can I verify that? Can I verify that in the algorithm? In a polynomial time or a given amount of time that is a limited or a finite amount of time. Can I do that? So, now this can be done in the same way. Say for example, if this is an algorithm, so I can write an algorithm and also I can check. If I give a solution, I can check it in a given amount of time. There is one more case where I cannot write an algorithm which is very difficult which runs in a finite amount of time but given a solution, I can do it in a polynomial time. I can check the, I can verify whether it belongs to the same problem or the solution is correct or not that I can do in a given amount of time that is the second category. The third category is I can neither write an algorithm nor I cannot, I can do verification of a given solution. So, there are three cases we know. What is algorithm? And then in algorithm, we can write an algorithm in and we can run the algorithm in a given amount of time or we can verify the solution in a given amount of time. In the third case, we cannot do for anything, okay. So, the third case is where actually we cannot write an algorithm, where it takes a lot of time or there is one more constraint called space that we will discuss in the coming slides. So, now the one case is you can write an algorithm and you cannot write an algorithm. Then we cannot write an algorithm. What happens? So, you will write something else. So, we are writing some intelligence based methods like artificial intelligence techniques. There can be, artificial intelligence can be one of the solution for such type of problem where you will use some machine learning or deep learning methods where you cannot find an optimum solution in a given time. So, it is not possible to find an optimum solution but you can still try to find the best optimal solution for a given problem. So, those deep learning methods are called architectures. You have seen so many architectures and also there are several methods, machine learning methods which we use and even we can write frameworks to solve such problems, where we cannot write the algorithm for the entire given problem to solve it in a given amount of time, okay. Now, going to the next part, we will, we have to understand if we can write an algorithm for a given problem, can we really use it all the time? To understand that, we have to know two different terminologies here. One is the complexity and the complexity is based on time and space. So, what do you mean by complexity? To understand the complexity, you need to know what is the amount of input data we are giving for a given algorithm, whether the algorithm is dependent on the given input data or not. Say for example, if you have a data, input data n and the algorithm takes n time to solve the given problem, that is called linear. So, the growth rate is linear, the given data, for that given data, so it is directly dependent on the algorithm. So, there are cases where actually algorithm may not be dependent on the input data. If algorithm is not dependent on the input data, then it is called constant growth rate. Similarly, we have other options that is linear, quadratic, exponential and factorial. Now, let us see. So, constant time you understood that it is not dependent on the input data. Based on the input, the time increases logarithmic. The next one is based on the input, the time of the algorithm to run increases linearly and the next one is quadratic, exponential and factorial or three other cases. Even though you can write an algorithm, can I really use this type of data? Can I use a data which is very huge for this type of algorithm where actually I take more time? Say for example, in quadratic or exponential or in the factorial case, if you have a huge amount of data, even though you can write an algorithm, but still can you solve this? It may take years to solve. Say for example, there is a classification of all the living algorithms you want to do. The data will be very huge and the data will also be very huge. You are considering let us say image data. Then your algorithm may take so much time to solve that given problem. So, even if you can write an algorithm, but still it may take years to solve that problem. Such cases we are not using it. So, what we have to understand is there are few cases where actually even though we can write an algorithm, we may not be using such type of algorithms where the data is very huge and it runs quadratically exponentially or a factorial waste. The growth rate is factorial. So, now till now I use the word time. So, this is time. So, now similarly there is something called space. In every algorithm, so for a given data based on the algorithm, some data will be stored in the computer as well. So, if the data takes more space to store. If the data is very huge and also the computations are very high. Even the computations are very low. The space it takes to do that computation is very high. Let us say. So, even that increases quadratically or exponentially based on the data. Then even those algorithms we may not be using. Why? Because if the data is huge, it takes huge amount of time. That much time we cannot provide that is one case. The second case is about space where actually you have a huge amount of data. So, you cannot store that much data in your machine. So, the machine has its own limitation. It cannot store huge amount of data. So, in that case also we avoid using such algorithms. So, now why we need to understand this is two cases. One is efficiency, another one is effectiveness. What you mean by efficiency? Efficiency is how efficiently we can solve the given problem. Now an algorithm solves a given problem efficiently. If it takes good amount of time and good amount of space for any given algorithm, then it is an efficient algorithm which can solve the given problem in a given amount of time using the respective space. Now the thing is efficiency, can you increase the efficiency of an algorithm which is quadratic in nature for a raw data to a linear by doing some modification in the raw data. Can I do some modification in the raw data? So that a quadratic algorithm becomes a linear algorithm or you wrote an algorithm which is quadratic in nature by doing some changes in the raw data, you can write a new algorithm solves the same problem which runs in a linear time. So, is it possible to do that? Here data preprocessing comes into the picture, this is about the efficiency. This efficiency is true both for time as well as time as well as space. So, in the definition you saw about efficient, efficient way. So, efficient is always about the time and space and we can process some part of the data where we can make it more useful for the given method. Similarly, we can also do for the effectiveness. Effectiveness talks about how accurate, so what is the accuracy of your method for the given problem, how accurate the algorithm is. Can I do some modifications in the raw data which gives us the same information, but it is a process data from that data if I apply a method, can it be more accurate. So, effectiveness talks about accuracy, I mean the performance measures which are used in the effectiveness are like accuracy, precision, recall, F1 score. These are the performance metrics which you learnt in this course. So, this also we can optimize it, we can optimize the way in which it works. So, this comes under effectiveness. We slowly see what is efficiency and effectiveness with some examples as well while we go through the course. Apart from this, apart from making it useful, apart from making it efficient and effective, what else can you do? What are the other challenges which makes us to use the data preprocessing? The answer is one more dominant one where we require data preprocessing is errors. When we collect the raw data, there are chances that the data may not be complete. So, in those cases also we have to use data preprocessing. Let us see. So what you mean by error is, error is you are not storing the data for that particular, there is a missing value, is one of the simple example of error. So, you are collecting the data and some information is missing. How does this error happen? So, now to understand that we need to know what are the types of errors. So, there are types of errors, one is human error, another one is system error. What do you mean by human error? So you are collecting the data from a human and obviously there might be a mistake while entering some data. What are the types of mistakes that can happen? So one is slip. Slip is you know what you have to fill, but you did not do it correctly at that point of time. Say for example, if you are entering an email ID of yours, you are entering your email ID, accidentally you miss act symbol. So accidentally you missed dot com or accidentally you missed dot in the com, before the com accidentally you missed the dot. So these are actually slip where you know the answer, but you fail to do it at that point of time. What is mistake is you do not know what exactly do, but you think that that is the correct one and you try to do that, that is a mistake. Apart from these two, there is something called violation. What do you mean by violation? Basically if you are not willing to provide the data, then it falls under violation. Let us say for example, in violation if you do not want to enter the salary part in the given Google form. So that can be part of violation. So where willingly you do not want to put that information. Apart from this what can happen? So one is human error, another one is system error. So now we know human error, it can be slip mistake or violation. Similarly system error can also happen. So what do you mean by system error? System error means, so say for example you collected the data in a 64 bit machine. So in that how it stores the integer value is different from a 32 bit machine. Now we collected data in a 64 bit machine and utilized all the bytes which is required to store an integer and then when it converts to the 32 bit machine, there may be a chance that some information may be lost. So similarly there are ways in which a system error can happen where the data can get corrupted. So these are the two challenges that is one is human error, another one is system error. So how to address if the error happens due to human or system? In that case also we need data pre-processing. So now what we learnt is data pre-processing is required to make it more efficient and effective. Apart from that it is also required when it encounters an error. Error can be human error or system error. To summarize we saw what is data processing at an abstract level. So why data processing is required? Given a raw data it can be converted or processed to get some information. From that information we can extract some knowledge. So while converting the raw data to the processing we can do something called pre-processing to make it more efficient and also effective or robust. And also to handle the errors we can do data pre-processing. Similarly we can also do data post-processing if required when we convert information to the knowledge. In that case if it is required to do the data post-processing to extract some information at a higher level synthesis or to extract knowledge. So we have two types of data processing. One is data pre-processing and the other one is data post-processing. In data pre-processing we saw at an abstract level why data processing is required. So what you mean by algorithm? What is a complexity of an algorithm? Can we really write an algorithm for any given problem? If we can write an algorithm so whether we can really use it. If the complexity is too high maybe time complexity or a space complexity whether we can really use an algorithm or if we can process the data given raw data into some format where actually the complexity becomes less reduced. That is the complexity becomes less so that the time or space required for a particular algorithm gets reduced and we can run it in a real time or based on the problem statement. So can we really do that? So that is about the complexity. From complexity we learnt what is efficiency and what is effectiveness and also we saw what are the various other factors which also makes us use the data pre-processing that is errors. Errors can be human error, system error, human errors are slip where I know something but still I did not do it at the right time. The mistake is I do not know correctly what exactly I have to do and the third one is violation where purposefully or willingly I am not providing the proper information. And the next case is system error because the data we collect and store depends on the system which we are using and the system configuration. So if there exists any error due to the system then we can handle it using data pre-processing. In the next lecture we will see in detail what are the various steps in data pre-processing, for the major task which we do in the data pre-processing and we will try to understand some of the general methods used in the data pre-processing.