 Hello friends, today we are going to see how to construct signature file and how to search using signature file. So learning outcome for this session is students will be able to create indexing structure signature file for given text and they will be able to search the text using signature file. So what is signature file? So it is a word oriented structure based on hashing. So we are going to use a hash function to get a signature. So this hash function will map words to the bit mask of b bits. So what is the step here? First we have to divide the text block into b words each, after that apply the bit mask for the words present in that text, then do the oring of the signature of all the words which are present in one text block and then this will be the signature for that block. So identify the or find the signature for every block and then it write it in a sequence that is called as a signature file. So there is one phenomena called as a false drop. What is the possibility that the bits are going to be set for a particular word or through the word which is not present in that particular block? This is called as a false drop. So we will see with example this false drop but when what is the challenge while designing this or while selecting the hash function is that we have to reduce the probability of false drop and then we have to keep the length of the signature file as short as possible. Here we have to choose a signature hash function in such a manner that at least l bits are going to be set randomly. So what should be that l bits that we have to select? So there is one expression here let alpha is equal to l by so l is a number of bits set and b is the total number of bits in the mask. So since each of the b words are setting l bits at random, the probability that given a bit mask is set in a word signature is 1 minus 1 minus 1 by b raised to p m which is equivalent to approximately of course 1 minus e raised to minus b alpha. So probability that l random bits set in query are also set in the mask of the text box is 1 minus e raised to minus beta alpha raised to alpha beta which is minimized for alpha is equal to ln of 2 divided by b. So the false drop probability under the optimal selection of l which is nothing but b into ln of 2 divided by b is if we look at the expression it is 1 divided by 2 by 2 raised to l. So we have to select the appropriate proportion of number of bits divided by number of words in that block so that this false drop will be reduced. Now let us look at the example. So first we need to divide the text in a blocks of b word each. So we have taken the same example as we have seen for inverted index. So apply the hash function or find a signature which will map the words of bit mask of b bits. So here we have applied one hash function and then hash values or the signature has been obtained. If we look at this signatures 2 bits or the r set to 1 so in our case l is equal to 2 whereas number of bits are 6. In the third step what we have to do is that we have to or all the signature of all the words occurring in that text block. So in the first block it is only single word text so what is its hash function or hash value that is going to be the bit mask for its block. In the second block we are having two words one is a text and other one is many. So we have to do the oring of these two signatures and the resulting will be the signature of block 2. In the third block words is occurring twice but it is a single word so of course oring with value of the two words will be same and this is going to the bit mask for third block whereas in fourth block we are having two words key words made and letters. So if we took the oring of these two this is going to be the bit mask. So thus how we have found the bit mask for all the blocks. So once we have found the bit mask for every block so the bit mask followed by the pointer to actual text is called as a text signature. So collection of all these bit masks and the pointers is our signature file. So I hope you have understood how to create signature file. So once we have created signature file we have to see how we can search. So for searching any keyword any query what we have to do is that if it is a single word or if it is a phrase query so first we will look for the single word. So if it is a single word find out the bit mask for that particular word then we have to do the anding operation of that bit mask. So in this case we are calling it as a W. So W and the bit mask for every block if it is resulting in the same it means that all the bits set in W are also set in B then there is a possibility of finding the word in that block. So it is may content because we have already seen the phenomena of false drop. So all those text blocks where we have qualified or we have found this condition true we are going to do actual searching for finding whether the word is present or not. So take this example our query is many. So first find out the bit mask for the query word which is 114 times 0. In the next step we have to do the anding operation of this particular mask with the bit mask of this particular block and then get the result. So here it is of it is written B, I and W so the result is going to be like this. So only the second block is going to get the same signature as the query it means that our word may present in the second block. So we have to do the actual searching sequential searching in the second block and then see that whether actual word is present. So in our example many is present in the second block. Now the next example is we have taken the query as words. So find out in which block this particular word is occurring. So pause the video and try to find out its result. So here again the same process so 100, 100 is a bit mask for this query and then we are anding with every bit mask block. So B1, B2, B3 and B4. And what we can see that here these three blocks are qualifying blocks because the result is same as the query. So we will do actual online searching or sequential searching to find that whether every block is containing this word words. But if you look at actual searching what we found this is the false draw for block 2 and block 4. Though the bits are set in the query same as block the word words is not present in the second block and fourth block. So I hope now you have understood what do we mean by false draw. So bits are set but the word is not present. So this is how single query is going to be searched. Now how to search phrase query and proximate query. So this is the most efficient indexing structure for phrase and proximate query. So what we have to do is that we have to perform the bitwise OR operation for all the words in the query and the remaining searching steps are same. So let us look at the example here again. So query is made from letters. So what are the words made and letters from is a stop word. So bit mass for made is 001100 whereas for letters it is 100001. Do the ORing operation of all the words present in the query and get the bit mass for the query. So this 101101 is a bit mass for the query. Again same perform the ANDing operation with every bit mass and see that which is the qualifying block. So here in this scenario we have found that last block is the qualifying block which is finding the same bit mass as a query bit mass and then we can go on searching it. So same here made from letters when we will start searching sequentially you will find that made from letters is actually present in the block 4 as phrase. If instead of made from letters if the query is made up of letters and when we will actually searching it then it will not match because made from is there in the text but what we are interested in made up because we are searching for a complete phrase. In that case the result will be considered as a false drop because it is present some words but the complete phrase is not present. But it is proximity query then made from letters and made up of letters are nearby matching. In that scenario the result will be retoured. Thus this is helping much this indexing structure is more efficient for the proximity query which is not possible which is where in inverted index we do require some more operations to be performed for this phrase and proximity. This is how we are building signature file and searching using signature file. So but only the thing is that now suppose we are finding query many words as a phrase. Now in this scenario if we look at our text many is in one block and what is in other block. So here we have to take care of that when we are searching for words how many words are here? So j words or 2 words when we are searching for a block we have to search the block or we have to take the consecutive overlap consecutive blocks should be searched for the overlap j words. So that if the word or if the phrase has been broken into 2 blocks then we will be easily find it. So if we are finding the if we are finding this many words it may be the possibility that the block 2 and block 3 has been qualified. So when we are going for sequential searching of block 2 if we have found many then you should go for the next 2 words of block 3 to find out whether it is really containing many words or not. This here has to be taken for phrase query. Thank you.