 Hello everyone. In the last lecture we have seen the classic information retrieval model that is Boolean model. Today we are going to learn vector model. So, learning outcome for this session is students will be able to create vector model and retrieve the documents for a given query using vector model. Now in Boolean model we have seen that there is no notion of partial matching or since the weights are binary we are going to we are getting a result either relevant or non-relevant. So, that is giving you the too many documents or it is giving you the exact matching over there. So, how it can be improved in this vector model? That is why the weights are considered as a non-binary weights and the positive weights. So, here the pair is Ki and Dj it is nothing, but the weight of keyword Ki within document Dj. So, all the all the weights that we are defining are non-binary in between 0 to 1. Also we are going to assign the weights for the index terms in the query that is nothing, but here wiq. So, weight of the keyword Ki with respect to query q which is also again greater than or equal to 0 and lesser or equal to 1. So, how it will be query vector? Weight of every term that is first term, second term up to assuming that there are t terms with respect to query and the vector for the document will be w1j, w2j up to wtj. It is nothing, but the weight of all the t terms with respect to document j. So, in Boolean model we have seen that this weights will be either 1 or 0, but here since it is non-binary what is our task is that to find the weight vector based on the frequency or the occurrences. Now, we what we have obtained down the two vectors one is a document Dj vector and the second one is user query vector in t dimensional space. So, what is the degree of similarity or which documents it will be retrieving? So, vector model proposes to evaluate a degree of similarity of document Dj and query q as a correlation between these two vectors. As you can see there are two vectors this is the Dj vector and the q vector and this correlation will be given by the cosine of theta where theta is angle in between these two. If you can simplify this this dot product of the vectors which will be converted into terms of the weight vectors of document and query that we have entered. So, we can see that as the distance or the as the angle is less then document is more similar. If the angle is increasing document is not that much similar because we are going for here of approximate matching and if exactly this q and document Dj vector are same means they are exact similar. This is how we are going to calculate the degree of similarity of this document. So, you can consider this is an ranking function of a vector model. So, let us see how to create the weight vectors and then find a degree of similarity for the given query. So, here since the weights for both are varies from 0 to 1 we are going to get the degree of similarity also in between 0 to 1 and that is where we can call that the vector model ranks the documents according to the degree of similarity. So, some of the document may get the similarity as 0.9, 0.8, 0.5 and so on which is in between 0 to 1 and more is the value more is the relevance over there and that is why we call we can call it as a vector model partially match the documents and of course, we are going to define a threshold also all. So, though the degree of similarity is 0.1, 0.21 and so on which document should be taken into the result and which not it depends on the threshold that has been set. So, for a given so in this particular vector model while defining the weights we are going to consider the concept of clustering. So, what is that concept is that we have given one collection of the objects. So, let us look at this basket as in collection of objects or we can call it as in vegetables over there. So, a weight description of that object which we want to search is given and what is our goal? A simple clustering algorithm which will separate the collection into two sets. What will be that two sets? One set will compose of the objects related to set A and second will be composed of the object which are not related to set A. For example, here I am finding the vegetables which are comparable to carrot. So, comparable to carrot means that either it can be the fruit vegetable or length wise. So, for example, this what you can say that this spinach or some leafy vegetables will not come into the category or what I am saying is that I want all the vegetables whose length is similar to carrot something like that. So, this is a description that I am giving of course, I mean this is the onion that will be separated into the second set or but when we are looking for this some cucumber or this bottle goad or something like that it will come into the category of the carrot. So, what we are finding is that we are finding some features which will be similar to carrot and some features which will not be similar to carrot and this is how two terms are going to be described. So, what are that two terms which is nothing but the intra cluster similarity and inter cluster dissimilarity. Now, how this IR problem or the retrieval problem can be considered as a clustering problem. So, here we are given in collection of the objects is nothing but our documents or the collection a text collection and our vague description is nothing but the query what you want to search that is nothing but your description of the object. So, what the IR problem can be reduced to is that finding the documents which are in the set A that means, which are similar to the user query and which are not in the set A this is how we can find. So, as I was discussing about you what is the intra cluster similarity one needs to determine what are the features which will better describe the objects in set A. So, looking at the example of the carrot you can say that yes lane should be this much then it should be that elongated color if you are going for the color the color should be pinkish reddish and so on that way you can describe in that scenario if you are not specifying the length over there. So, may be some tomato will also come because it is in reddish, but that kukumbha that onion if it is in reddish it will come, but if you are describing based on the length then it will not come. So, you have to decide some features which will exactly describe that object which is called as an intra cluster similarity. So, all those objects will be in set A will have the same features or the properties. Whereas, all the objects which are not present in A will not have the same properties it is called as inter cluster dissimilarity. So, two clusters are there one cluster is having all the similarity in between them, but if you are going to check the similarity between the one cluster and the other cluster this is different this is called as an inter cluster dissimilarity. Now, how to use this in IR? So, intra cluster similarity is quantified by measuring the raw frequency of every term inside the document dj. This is called as a term factor ok. So, one has to provide how well that term describes a document. See if a particular document is there how you are going to identify that this document tells about this. So, if a index term is describing that document it is occurring it is going to occur more times ok. This is how we can define the raw frequency. Whereas, an inter cluster dissimilarity means it is going to measure the inverse of the frequency term among the documents in the collection. So, in how many documents it is present out of the collection if we take its inverse it will give you inter cluster similarity. So, for example, out of 10 my keyword is occurring in 7. So, in 3 it is not occurring that is nothing but here inter cluster dissimilarity. So, how to calculate this weight vectors? So, n is a total number of documents in the system, ni is a number of documents in which index term Ki is appearing, frequency ij is in raw frequency of term Ki with respect to document dj and normalized frequency. So, we are normalizing it in between 0 to 1. So, how to normalize it? Find the maximum frequency and divide it for the all the raw frequencies that is nothing but normalization. Inverse frequency as we have discussed right now in ni is a number of documents in which Ki keyword is present. So, if you take the inverse at log of n divided by ni it will give you inverse document frequency. And once you have identified the term frequency and inverse document frequency there are multiple schemes for this weighting one of them is this TFIDF scheme. So, normalized frequency multiplied by this IDF will give you the weight vectors. So, this is how and once we have created weight vectors for the document in the same manner we can create it for the queries and then we can find degree of similarity. Yeah, this is the formula for the query vectors because this 0.5 term has been entered because it may happen that only few words with only one of the occurrences can occur and that is how we can create the weight vectors for the queries. Now, let us look at the example. So, these are the three text documents that we have taken. Now ni, so yeah of course these are the number of documents in which the keyword is present. So, first see which are the index terms that we have taken. So, these are the six index terms that we have identified from this collection for finding the index term again you can refer the video of creation of logical documents. So, raw frequency how many times this keyword has been occurred in the first document ok. So, if you can see this mount is occurred twice that is why the raw frequencies to Everest is also occurred twice or one time mountain one time and these are not present that is why 0. So, out of this maximum is 2. Again we have found the raw frequency for the keywords in the second document. So, mountain, Kalsubai these are the two keywords which are present that is why the maximum is again 2 out of this is 2 and then in the third document mount mountain and Fuji is occurring. So, maximum is 1. So, this is how raw frequency and maximum has been found for normalized frequency divide this raw frequency by this maximum and this we are getting it as a raw normalized frequency which is nothing, but F i j ok. After that this is the n i number of documents in which this keyword is present that we have identified and then we are going for inverse document frequency. How we have found the inverse document frequency it is log of n in our case n is 3. So, 3 divided by 2 it is 1.5 and its log value is going to be 0.176 in the same manner we have calculated inverse document frequency. So, once we have identified this IDFI and F i j let us calculate the weight vectors. So, at this moment you can pause the video and try to find out the weight vectors. Yeah, it is in multiplication of F i j into IDFI and how many weight vectors we will obtain 3 weight vectors. So, W that is D 1, D 2 and D 3. So, 0.176 multiplied by this particular weight that is 1 and so on this is how we have got this weight vectors. In the same manner we have to calculate the raw frequency for this query. So, assume that our query is mount Corsubi where mount and this Corsubi is occurring only one remaining r 0 and only one time. If it is occurring more than one time you can give 2, 3s as many as occurring. Maximum is 1 that is why the raw frequency and normalize frequency will be same. Idea factor will be same as we have calculated for the document and then we can go for calculation of this weight vector for query. And once we have calculated this weight vector and query weight vectors for the document and query we can calculate a degree of similarity. If you remember the formula it is the weights which are present in the query. So, the weight of the mount and weight of the mount in document and query whereas the second key word is Corsubi which is not present that is why it is 0 divided by the square root of squares of summation of all the weights of the keywords in the first document and weights in the query. So, this part will be common everywhere only this part is changing. So, after calculation we have got a degree of similarity is like this 0.14, 0.91 and 0.16. So, the most relevant document is 0.91 that is D2. So, ranking how we will be doing D2, D3 and then D1. Here we are not considered threshold, but this is how we can rank the document. So, we have to calculate this degree of similarity for every document. So, what are its advantage? So, of course, this it is improving the retrieval performance because there is a partial matching. So, partial matching strategy is allowed which is of the documents at approximate the query condition. So, this cosine formula gives you the sorting of the documents according to the degree of similarity, but what is the disadvantage here? Index terms are assumed to be mutually independent. We have not considered any dependency of the keywords right now. So, this is how vector model works. Thank you.