 Hello friends, today we are going to learn inverted index that is the indexing structure. So learning outcome for this session is students will be able to create inverted index for given text collection and they will also able to search in that text collection using created inverted index. So let us see what is inverted index which is also called as an inverted file. So this is word oriented mechanism for indexing text collection and we know that why we define index to speed up the searching task over there. Now this inverted index consists of two components, one is vocabulary and the other one is occurrences. Vocabulary is nothing but the set of all different words which are occurring in text or keywords which are occurring in text and occurrences it is going to be the position of the keywords that we have identified in vocabulary. Now this position can be the character position or can be the word position. So let us look at the example. So here these are all the character positions for the text. So this particular word text is occurring at the position of 90 or this words w is occurring at the position of 40. So when we are talking about this words as a word this is going to be the position here. So here after removing the stop words we have identified some of the words from this particular given text. So these are the five words which are going to see which we are going to take as an example. So first step while building the inverted index is that identify the index terms or words. So these are the words which you have already identified. Then in the second step we have to sort them when we are storing it in vocabulary. So these letters made many text and words are sorted and then store it in vocabulary. So once we have identified the vocabulary every word has to be associated with its position. So in the third step we are associating it with occurrences. So these are all character occurrences. So letters is a word which is occurring at the position of 60 and so on. Now the text word is occurring twice here 1 and 2. So there will be two positions. Same is the case with words word it is occurring twice. So there will be two text positions or two occurrences over there. This is how we are storing character position. Now let us look for the word position. So here we are storing all the words rather we are finding the word position so there are 14 words and then instead of character position we are going to store it in word position. So here the same words that we have considered but the position is going to be now the word position. That is the difference in the previous and this particular position wise inverted index. So here the space required for vocabulary is going to be small as compared to text because we are identifying the index terms. So according to Hipp's law vocabulary grows as O of n raised to beta and this beta will be a value in between 0 to 1 practically it is in between 0.4 to 0.6 whereas occurrences demand much space than vocabulary which is 30% to 40% of the text. So to reduce this space requirement we can use block addressing. If we are storing the indexes with exact position which we have already seen then it is called as a full inverted indexes. Now let us see block addressing. So in block addressing the text will be divided into blocks and occurrences will point to block instead of exact position. So here as we are storing the block position so many words will be pointing to single block or if the many words are occurring in the same block itself the number of occurrences will get reduced. But only the thing is that if you want exact position to be identified then we have to first identify the block and then we have to do the sequential searching in that block. Now how to define that block? So block can be defined with fixed size or it can be defined with the natural division. So for example B words will be there in every block that can be the case or wherever it is completing the sentence that can be a block. So both of the method is having its advantages and disadvantages. So here we have created four blocks of the given text and then again same the vocabulary is same which we have already identified. So instead of now the exact position that is the word or character here we have stored the blocks. And now we can see that the words word is occurring in the same block and hence there will be a single entry that is 3 whereas other will be representing that block. So this is how we can use block address. Now we have seen the inverted index for a given document. Now how to do it for text collection? So assume that these are your three documents. So pause the video at this moment and try to find out the inverted index for every file separately first. So this is the inverted index for the first document for the second one and for the third one. So for every document some keywords are identified and then it has been associated with its occurrences. But when we are giving a query that query will search for the all the documents in the collection. It will not search for only a single document. So how to create inverted index for this complete text collection? So what we have to do is that we have to combine all the keywords. So if the keywords are repeating obviously that will be taken only one occurrence for that. So assume that after merging these are the keywords that we have identified and then these keywords will be associated with the position with respect to documents. So here all these words are occurring only in one of the document that is why there is only a single entry. But the word mount and mountain is occurring in two documents. So it is pointing to two references. So when we will search for a particular keyword it will be searched in a vocabulary and then respective documents and the position will be retrieved. So this is how we have to build inverted index for a given text collection. So once we have built that inverted index how to search it? So these are the three steps for finding that text for a given query. So first is that vocabulary search. So words or whatever the pattern is given as an query that we have to search every word in vocabulary then retrieve the occurrences. Once the word is found in that vocabulary associated occurrences will be retrieved and in the third step it is manipulation of the occurrences. What does it mean? So if it is a single word we are going to get directly the occurrences. But if it is a phrase query or a proximity query or a query with Boolean operation that find information and retrieval then there this is that we want all the words which is having information and retrieval. So this is a Boolean expression or I want information and retrieval but not ICT or communication technology. So if such queries are there we have to do some manipulations. So in the third step if you are having the proximity phrase or Boolean operation we will be retrieving the occurrences for every word and then we will do the operations on that occurrences. This is the manipulation of occurrences or if you are using block addressing then we will identify the block from the occurrences and then we will be going for sequential searching in that block if anyone is interested in exact position. So let us look at this with example. So here it is a single word query. So our query is many words. Single word query means every keyword will be searched separately. So keywords are many and words. So both the words will be searched separately in vocabulary and the occurrences will be retrieved. So for storing this particular word any data structure can be used like B plus 3 or we can use hashing and then based on that these occurrences will be retrieved. So these are the occurrences this is for single word query. For prefix query how it will search for example we want to search all the keywords which are starting with m or m star. So what it will do is that it will go for searching in vocabulary and it will find the first keyword which is starting with m or whatever is given as in prefix. And from that particular word it will go sequentially till we are not finding the word which is not starting with m and then whichever the words are retrieved from vocabulary associated occurrences will be given. If it is a range query, so for example I want to find all the words in between get and take. So how it will be searched? So it will search for first keyword get then it will search for take and it will take all the keywords which are coming in between get and take because this vocabulary has been stored sequentially sorry alphabetically and that is why all the keywords that are coming in this particular range will be retrieved. So here letters made and many will be retrieved from our example with occurrences 60, 50 and 20. This is for range query. So in range both the keywords will be searched separately get the occurrences or once you find that particular word all the words till this particular word will be taken. What about phrase query or context sorry proximity query? It means that phrase query means we are searching for the more than one or two words or for the complete word. So for example made from letters. So we are searching for complete made from letters or we are searching for made up of letters. If it is phrase we will search for completely made up of letters but if it is proximity if our query is made up of letters and if our text is made from letters still it will be searched. So how it will be searched? Every word now made from letters from is a stop word. So made and letters will be found separately or up off are stop words. So made and letters will be taken they will be searched separately and their occurrences will be generated. Now here we want the complete string to be searched. So in the third step it will be the manipulation of occurrences. So whatever the occurrences that has been retrieved. So if you look at the example it is 50 to 60 that the keywords are occurring. So these list or the elements are traversed in synchronization to find the places where it is actually appearing. So if you search from 50 to 60 in the text what you are getting made up from letter. So if it is a phrase query exact match has been found. So sequentially search will start from 50 and then result will be retrieved. Assume that our query is made up of letters but when you are searching sequentially what we will get made from letters it is nearby matching that is why this will be also retrieved. So here this if it is a phrase or proximity query we have to manipulate some occurrences here. I hope you have understood how we are searching using inverted index. So this is the reference, thank you.