 Hello everyone, we are discussing about the document pre-processing or creation of the logical view of the document. In the first part we have seen the first two steps now we will be looking for the further steps. So, same is the learning outcome students will be able to pre-process the document and generate the set of index terms. So, yes we have seen that in document processing what is important is that generation of the set of index terms either we can go for the abstract view or we can go for the full index terms. So, yes with this lexical analysis elimination of the stop were these two steps we have seen in the last and then today we are going to in the last lecture and now we are going to see the stemming selection of index terms and construction of the thesaurus. So, this is the example that we have seen this is the text that we have given and then after lexical analysis these terms or the sub words will be generated by removing the separators. So, here separators consider space and comma. So, again we have removed the stop words the words which are not carrying meaning any information can be removed. So, particularly articles, prepositions, adjectives and so on or the connecting words are removed and this is the words that we have identified after elimination of the stop words. Now, let us look at the next step that is stemming. So, what is stemming? So, frequently uses specifies a word in the query, but only a variant of this word is present in the relevant document or the vice versa is the case ok. So, what is stem? So, stem is the portion of the word which is left after the removal of affixes. So, we can remove the prefixes and suffixes and what is remain is the stem. For example, here consider connected, connecting, connection and connections so on this are all the variants of the word connect. So, once we are having the connect, we can go for any of the document which is connecting all these variations of connect. So, stems are thought to be useful for improving the retriever performance because they reduce the variants of the same root otherwise what will happen in the same document I can store all these 4, 5 words instead after stemming one word will be used for this 5 words. So, of course, this is going to save the space over there. So, it will reduce the size of the indexing structure since the number of index terms are reduced, but there is a controversy available in the literature about the benefits of the stem. Now, there are many algorithms available that like affix removal, table lookup, successor variety, n grams and so on. Out of that the one algorithm is affix removal algorithm that is Potter algorithm. So, what the Potter algorithm uses is a suffix list or the suffix stripping. So, idea is that apply the series of rules of suffix to the suffixes of the words in a text. So, what kind of suffixes? So, one thing can be from plural we are going to convert it into the singular form or if it is in past tense we are going to convert it into the original verb form and so on. So, let us look at the examples and the rules that we are written as. So, s s e s will be replaced by only s s. So, for example, here k s s will be removed by this k s. So, ponies i s will be removed by i. So, ponies will be converted into the pony, ties will be removed to tie, s s will be again replaced by the same. So, we can keep the same as it is or s can be replaced by the nulls. So, for example, cats will be converted into the cat over there. So, this is some of the rules. Second rule is that this past tense so, feed to feed only agreed to agree then plastered that can be removed to plaster ok. So, monitoring can sorry motoring can be converted by removing i and g to motor singing to sing and so on. So, here it should be actually singing. So, singing will be given to the sing s will be replaced by the nurse. So, same as example here cats by cat over there. So, this is one example that we have already converted our after removing the stop words this is the words that we have obtained. Now, this can be converted after the this can be a staving can be applied to this particular data and then we are going to obtain the. Now, unfortunately there is no common terms available, but yes we are going to apply the Potter algorithm and then let us see what the words or the which is a set of the words that we can obtain. So, pause the video and let us write down some of the words after staving or after applying the Potter algorithm. Yeah. So, these are the words consist or the documents or the keywords will be converted will be this s will be removed by applying the algorithms and this is the words that which we are getting after staving. Next is going to be index term selection. So, after getting these terms which word should be adapted. So, if there is a full text representation then all the words in the text are used as a index term. If you remember in part 1 we have seen a diagram at any state we can go for the creation of the index terms or alternative is you can use the more abstract view in which not all the words will be used as an index terms. So, in the area of bibliographic sciences selection of index terms is usually done by the specialist. So, or the alternative approach is that it will be done automatically. So, any of the algorithm is used for adopting or selecting this index terms. So, which will be the index terms? Generally we see we have already discussed that nouns will be the index terms, but sometimes it is common to combine two or three nouns in a single component. For example, you are having computer you are having science separately, but yes computer science is required many times as a single word or a single group. So, we have to cluster this nouns which will appear like a single word over the single indexing component or a concept over there. So, noun group is a set of nouns whose syntactic distance in a text which will be generally measured in the term of number of words does not exceed the predefined threshold. For example, we have defined the threshold of three. So, maximum we can have the three words to be combined into the one component. So, information retrieval. So, many times we require this information retrieval if you are going to see or if you want to retrieve the information about this particular subject or the algorithm and so on. So, after in identifying this index terms what is the next term is that creation of this thesaurus. So, before that what is this thesaurus or what is this word come from? So, it is a word come from the treasury word. So, it is a pre-compiled list of important words given in a domain of the knowledge and for each word we are having the set of related words. So, it involves a normalization of the vocabulary and it will include a structure which will be more complex than the list of words and the synonyms. So, we know that when we do not have some information or we do not we do not know the words we take the help of this thesaurus which will suggest some related words or synonyms for us and when we can use this particular words. So, what is the use of this thesaurus? So, main purpose of this thesaurus is to provide the standard vocabulary or a system of references for indexing and searching. Second is that to assist the users with locating term for proper query formulation. Sometimes I want to find some information, but I am not able to form the query. I do not know what is the word that I have to use then at that time I can use this thesaurus for the help. To provide the classified hierarchy that allows the broadening and narrowing of the current query request according to the needs of the user. So, user has written some long description over there then processing such query will be more complicated for this search algorithm. So, better to sometimes we need to expand it and we need to compress it to the specific words so that it will be more easy for the search algorithm. So, what is the motivation of this building thesaurus is based on the fundamental idea of the control vocabulary of the for the indexing and searching. In the first lecture only we have seen that when we are deriving index terms it is the task like controlling the size of the vocabulary because as a text is growing the size of the vocabulary is also it will also grow. So, why we require this control vocabulary? So what are the advantages? So, normalization of the indexing concepts it will reduce the reduction or there will be the it will reduce the noise or there will be the reduction of the noise. It will identify the index terms with clear semantics or the meaning over there and retrieval will be based on the concepts rather than words ok. So, this main component of the thesaurus is its index terms relationship in between the index terms and layout design for the relationship. Let us look at this terms one by one. So, thesaurus index terms this is the term used in which can be used for denoting a concept. Now, the basic unit is that again the now so this can be the terms can be the words group of words phrases or but not but most of them are single words sorry though we are giving the group of nouns or group of words. The terms basically are nouns because we know that nouns are the most concrete part of the speech. The term can also be in gerund form whenever they are used as a noun. So, for example, acting, teaching and so on we will be using as it is. So, decide we have to decide whether we are going to use for the singular or a plural or not not always the singular or plural. So, depending on this matter we can decide which has to be used. Also, it is necessary to complement the thesaurus entry with the definition of the explanation why this is required because we know that there are many words carrying with the different meaning over there. So, here for example, when we say seal so there are two context one is the marine animal and the second one is documents. So, whether when we are saying seal whether we are going for the marine animals or documents, if we will give that context then it will be more easy to search or the retrieve over there. So, here thesaurus term relationship once you have identified the terms the terms which are related to each other has to be written as the compressed has to be compressed as in synonyms or near synonyms so that it will be easy to search over there. Of course, the relation can be also induced by the patterns of co-occurrence within the documents. So, once we have identified this relationship once we have identified the index how this will be used in information retrieval. Generally, we know that when we use thesaurus when somebody is writing a writer needs some more information about the words he or she can take the help of this thesaurus. So, same year when we are writing a query so whenever user wants to write some of the documents he will first write a conceptualization or for what he or she is looking for. Because this has been given the collection is too vast and the user is inexperienced so he does not know about the what should be the index terms that I should use so that the result will be more proper ok. So, or as is the user is inexperienced initially or the first time result may be erroneous or improper then what will happen when he will take the help of this or it go through the thesaurus the user will be able to formulate the query or reformulate the query and get the proper result. So, this is how it will be useful the thesaurus ok. So, this is what we have seen in document preprocessing the first step is lexical analysis second is elimination of the stop word third one is the steaming fourth one is that select the index term after preprocessing all these things and then build the thesaurus structure or any structure like a thesaurus. And once we have done we can go for the searching or retrieval of the document thank you.