 Hello everyone. Today we are going to see document preprocessing or creation of the logical view of the documents required in information retrieval. So, this lecture has been split into two parts. So, today we will see in this lecture we will see two steps of this and then in the next lecture the remaining steps. So, learning outcome for this session is students will be able to preprocess a document and generate the set of index terms. So, let us understand what is information retrieval. So, information retrieval deals with the representation, storage, organization and access to that information item. So, user will be given an interface where user can enter the query and can access this particular information or the collection or the documents and so on in which she or she is interested. So, how to do this process or what is the background required that we are required document preprocessing. So, documents in the collection are represented using the set of index terms or keywords. So, these keywords can be extracted from the documents automatically or by the human subjects, but currently you know that the collection is too huge or the database is too large we cannot do it manually of course, it is going to be done automatically. So, whatever may be the process what it is providing us is the logical view of the document. So, here in document processing now which will be used as an index terms not all the words are equally significant for representing the semantics of the document. So, generally these are the noun words or the group of noun words which are giving you the most which are used as a representative of the document content. So, preprocessing the text of the documents in a collection is nothing, but to determine the terms to be used as an index terms. So, let us understand the steps. So, here which words will be used. So, if you are taking the set of all the words in the collection to index documents there will be too much noise. So, what we have to do is that we have to control the size of the vocabulary. So, what is vocabulary of course, this is the collection of the index terms ok. So, it is the preprocessing is nothing but can be viewed as in control the size of the vocabulary, but what happens generally while controlling the size of the vocabulary some user knows that this is the keyword or these are the documents which contains this particular information and when this keyword has been given as an query user is expecting that documents to be retrieved, but these documents sometimes are not retrieved why this is so, because the to control the size of the vocabulary search engines add some indexing steps or the additional steps in indexing. So, due to which these documents may be removed or there is index terms may be removed and that is why user is surprised with the result. And now it is therefore, most of the search engines are using all the index terms all the words in a text as an index term. So, that user will get the relevant result. Now, this preprocessing is having 5 steps. So, lexical analysis elimination of the stop words, steaming, identification or the selection of the index terms and then the construction of the structure such as thesaurus. So, in this lecture we are going to concentrate on this first two terms or steps that is lexical analysis which is nothing, but separating the words and then treating this digits, hypers punctuation marks and case of letters. So, specially over there and then the elimination of the stop words where we are going to filter out the words with very low discrimination values for retrieval purpose. Steaming is nothing, but remaining words or the words which we are after removing the affixes are going to be taken as an index terms. Then of course, the selection of index terms or noun words are frequently carry more semantics than the adjectives, adverbs, verbs and articles, prepositions so on and add then construction of the thesaurus. So, this is how the logical view is created or the set of index terms are created. So, document is containing text and structure. So, for the text retrieval model we are interested in the text whereas, in structured retrieval model where we are specifying structure also in the query structure is going to be recognized. So, here output will be structure and then text will be preprocessed. So, if you can see at any step the engine or the algorithm can decide which words has to be used as an index term. So, here after removing the spaces accent what is remain is taken all together as an index term this is called as a full text index terms or some of them can go up for removing the stop words then grouping it into then or make the group the noun groups after that do the steaming and then give the automatic or manual indexing. So, any step can be skipped and taken this index terms will be generated or created. But also let us understand how this lexical analysis will be done. So, lexical analysis is the process where the conversion the stream of characters into the stream of words ok. So, identify the separators or the recognition of the separators. So, generally the space is the most recognized separator. So, space commas can be used or the hyphens can also be used for separating the words over there and get the words as an index term. So, the most the major objective of this analysis is the identification of the word in the text over there. So, but here sometimes we need to consider the special cases like digits hyphens punctuation marks and case of letters to separate out the words. Let us understand one by one. So, generally numbers are not good index terms without tsunami context they are inherently vague. Suppose I have given a query select the data from in between the year of 2020 to 2021. So, what I will do is that 2020 and 2021 will be taken as an index term for searching, but the data which we I am interested in may be the since we have used 2020 and 2021 as an index term I may get the other data also ok. So, this is very how we should save, but it is not the case that we can skip the numbers we cannot save it. Because for example, sometimes I want to use a credit card number, mobile number, pan card number and so on for retrieval purpose. And then if I am not storing them as an index term then I will not be able to retrieve the data. So, this is one case. So, there are advanced lexical analysis procedures which performs the date and number normalization to uniform the format. Second case is the hyphen. So, as we have discussed hyphen can be removed and then their words will be produced. So, in the state of art this hyphens will be removed then there will not be any difference, but consider that the mother-in-law. If we will remove the hyphens it will lose the context or b dash 49 if this is a specific number of an item then what will happen or if we are. So, we have to decide whether at which point we should consider the hyphens as a separator or hyphens as a part of the work. Same is the case with the dots or the punctuation marks. So, we are going to treat it as a separators in sometimes this punctuation marks or dot is going to be the part of that work. So, we should not remove it. Consider that we are using that we are searching for the programs if it is in Java program or C program dots will be used. So, if we are going to separate out it will lose the context. So, that is also going to be decided whether to keep it or whether to remove it. Now, generally what happens when we are separating the words it will be either converted into the lower case or the upper case. But consider if you are using that Unix commands then it is going to be the mixture of that it will be in the proper case or it will be the mixture of the upper and lower case over there. Then converting into the one case will not serve our purpose or if we are considering this bank and consider this example. So, capital B means generally talks about the organization and when we are talking about the small letters we talk about the river bank ok. So, converting all the time into the upper case or lower case will not serve our problem. But there is no fixed rules that this should be considered at this moment only there are many limitations. So, there is no fixed algorithm given to this lexical analysis based on the application we have to decide. So, at this moment pause the video and what we have learned lexical analysis based on that try to get the set of words. So, you can decide your separators and then you can write down the set of words. So, here in this I have treated the separated separators as in comma spaces and then these are the words which has been generated. So, data retrieval consists mainly of determined and so on. All the words has been generated as a tokens or a separators. So, if you are using any language for implementation you can use the tokenizer class or you can use the NATK for processing it and you can generate these words. So, once these words has been generated the next step is going to be elimination of the stop words. So, which words will be considered as a stop words? So, words which are too frequent among the documents in the collection are not good discriminators. So, where which occurs in the 80 percent of the documents in the collection is useless for the purpose of retrieval. Why this is so? Because all the documents are going to be retrieved and that will not be giving you the fixed document or the documents in which we are interested in. So, such words are referred to as stop words and these are normally filter out as in potential index terms. Generally articles, prepositions, conjunctions are the natural candidates of this list of stop words. So, here of course, when you are removing all these words the words will get reduced. Let us see with the example also. So, it will provide the compression of the text. So, this can be this can be extended to as it is written article preposition and conjunction over there. But what happens now consider that queries to be or not to be in this if I am going to remove this all the stop words what will remain? Only the be. So, it is not going to serve our purpose. So, this is the one reason why this all the search engines are adopting full text index so that most relevant documents will be given to the user. So, this is what we have seen in the lexical analysis. This was the generated tokens after removing these prepositions articles and so on. These are the remaining terms which we have going to use for the further process. So, this is what we have seen as the first two steps. Next three steps will be seen in the next part of the lecture. Thank you.