 Hello friends, today we are going to see Gemini approach in multimedia information retrieval. So learning outcome for this session is students will be able to retrieve multimedia objects using Gemini approach or they will be able to design fast searching method. So what is the goal is that design of the fast searching method that will search a database of multimedia objects to locate the objects that match with query objects. So what can be that data or objects it can be 2D, 3D color or grayscale images, it can be the music voice or video clips or the financial or marketing time series data, DNA data and so on. So all these patterns that we are finding will be used for prediction and other purposes. So what can be the query examples here, find the photographs with same color distribution then find the companies having the similar stock then find the brain scans with tumor or the find a distance between 2 images which are closer to each other with specified distance. Now there are two types of matching patterns, one is whole match and the second one is sub pattern match. In whole match the size of the query object and the size of the collection or the object stored in the collection will be same. So for example objects are of 512 by 512 images and the query object is also of the same size whereas in sub pattern match we are finding some part of that image. So here example is that the brain scans are of 512 by 512 and the tumor is of size 16 by 16. So this is considered as an example of sub pattern match. So here for fast retrieval we are using indexing or the special access method. So these are some of the methods out of which we are going to use R tree. So R tree we are not going to explore R tree in detail but here what is R tree? It is storing the minimum bounding rectangle based on the positions. So we are considering now here the whole match. So we have a collection of n objects. So how we will find which are the similar objects or dissimilar objects that is why we are defining distance function. So distance between 2 objects or dissimilarity between 2 objects is denoted by d of oi, oj. So user specifies the query object and it also specifies the tolerance. Tolerance means how much distance is allowed in between 2 objects that much is going to be considered here. So goal is that find the objects in the collection within the distance epsilon from the query object. So how we will do it using sequential searching? So in sequential searching the query objects the distance between query object and every object will be found. If the distance is less than or equal to epsilon that is going to be resulted in the that will be added to the result. But this will be slow because of 2 things. One is that distance computation will be expensive. Second thing is that as the size of the database will be growing it will be the time computation will be more and that is why we should find the other approach of indexing. So the other approaches is this Gemini approach where the long form is generic multimedia object indexing. So we are using indexing approach here in which there are 2 things or the 2 steps. One is that quick and dirty test to discard quickly majority of the non-qualifying objects. There is a possibility of false alarm also. Now what do we mean by false alarm that we will see. And second is that use special access method to achieve the faster ritual that sequential search. So what is this quick and dirty test. So instead of using the complete data if we can represent that data in one or few numbers then that will help to discard the non-qualifying sequences or the objects. So consider the example of the stock prices. So assume that we are having the data of one year it means that at least 253 values will be there. So instead of taking all these 253 values for calculation if we will represent it into few numbers then it will be easy to discard the non-qualifying object. So for example if we will find the average of all the stock values over an year. So instead of 253 we are having only one value. So what will happen if the averages of 2 companies are differing more then obviously that objects are differing much. So it can be the case but vice versa it is not true. So we may have false alarm. So now what do you mean by false alarms is that though the averages are closer it may be the possibility that actual values are differing or though the averages are that is less possibility but still the averages are differing but they are closer to each other or they are similar to each other it may be the possibility this is coming under false alarms. But what is the quick and dirty test involves is that represent the data into few numbers and that number which contains the information about the data is called as a feature. So several good features can be selected of course that number should be less than the available data this is quick and dirty test. Now what is the second thing is that while selecting features that features should be the good features. So which features that will be allowed. So if they are following this lower bound lemma then that features are allowed. So what does it mean that the distance between features should be less than or equal to distance between actual objects and if it is the case we can take that features into consideration otherwise we have to go for other features. So what is the Gemini approach actually? So first is that decide the distance function that we are going to use. So here that distance function will be explained by domain export. So whether it should be Euclidean distance or other Kanbanan distance or so on it depends on the data so that domain expert can decide. Second thing is that find the numerical features using extraction function to provide quick and dirty test. Third is that prove that the distance in feature space lower bounds the actual distance to guarantee its correctness. Then we have to use spatial access method to store and retrieve F dimensional feature vectors. And now those who are qualifying objects we have to compute actual distance. These are the steps. So we will see this with one example. So consider that we have a number of company's data time series data for various companies and we want to find one which is similar to the given query. So for example one company Q's data is there and out of 10 companies which is similar to Q that we need to find. So what are the steps? So here if we will apply sequential searching so here we are going to use Euclidean distance. So assume that we are having 253 values. So we need to compute 10 distances query with S1, query with S2 and so on. And in every distance computing we have to take all this 253 value for calculation. So as this n will increase the computation time will increase. So what will be the Germany approach? So first step is decide the distance function as we have just now seen that it is a Euclidean distance that we are going to use for this example. Second find one or more numerical feature extraction functions, sorry one or more numerical features using extraction function to provide quick and dirty test. So what can be that features? One can find average yearly then it will be a single feature. If we will find average half yearly then there will be two features. If we will go for quarterly then it will be four features and so on. Or someone can use discrete Fourier transform or discrete cosine transform and wavelet transform get some coefficients and of course that number of coefficients will be less than actual data. So these features can be used. Further you have to prove that the feature distance in feature space is less than actual distance. So there is one passable theorem if we are using discrete Fourier transform it says that energy of signal and the distances between two signals it is going to be preserved or the distance between two signals and the distance between features is the same. So first F coefficients of discrete transform are used which lower bound the actual distance. So next step is that map actual objects into F dimensional space. Now what do you mean by this? F is a mapping of F dimensional points and F of O is the point with respect to object O. So consider this is one series that we are having and we want to use a single feature. One feature then we can find the average and this is going to be the value. So what we are doing is that these all values will be converted into a single map. So a single value so it is one dimensional value or if we are using two features then these are the two values that has been generated. So first is the average of first half and second value is the average of the second half. So this has been converted into two features or two dimensional feature space. So how to map it? Look at the example. So here we have considered 10 companies this is the query object and then we have taken here 14 values okay. So at this moment you stop this video or pause the video and try to find out the features that two features as we have discussed earlier. So one value will be the average of the first half and second value will be the average of the second half. So every 14 values will be converted into respect to two values. So this is what two features has been found that is nothing but F1 and F2. So map the actual objects into a dimensional feature space. What does it mean that? Now this is the first series, second series, third and so on. These are the values which were here considered. So this has been mapped to a two dimensional feature space. So from this values, 14 values we have found two values okay and that has been plot here or that has been considered as a point in two dimensional space. In the same manner all the 10 companies has been plot here as in 10 points. So this is not exactly same what we have taken the data but this is a sample data okay. As we have stored, so this will be stored it in R tree okay. As we have discussed SAM method is used for storage and retrieval. So these points will be stored. So there are 10 companies so there will be 10 points and we have taken two features that is why it is two dimensional. If there are three features it will go on three dimension and for n features it will go on going n dimensional space okay. Now as we have stored it in R tree, R tree is nothing but we are storing it in minimum bounding rectangle okay. So this is root is having two child nodes. This is one rectangle, second rectangle which in turn is again having the rectangle. So depends on the what is the fan in and fan out has been set. So that is not our discussion of the topic right now R tree. So here if you look at this red point it is for the query point okay. So once we have we want to find all the points which are nearer to this query point and that is why this circle gives you the boundary where we should find out the other companies. Why this is so because if strong is a distance which is nothing but the radius of the circle. So we want to find the companies which are similar or which are closer to that particular company. So how to start using this? So this is the query MBR. So the rectangle which is fitting that circle is taken as an MBR for query. Now how to start? Start with the root node. So root node is having these two child nodes. So we have to find a node which is intersecting with the query bound MBR. So this is intersecting that's why we will discard this part and we will go further. Now this particular rectangle again contains two rectangles. We have to find which is again intersecting. So this is intersecting but very low and there is no point involved. So this particular rectangle is intersecting with this particular query MBR and that's why we have gone here. And now this is the root node. There is no further rectangle inside it and therefore we will take all the points which are residing in this MBR and that is going to be called as in querifying objects. So here there are two objects one is S3 and S6. So these has been qualified objects. So what we have done is that we have out of 10 now only two are closer which will be taken into consideration thus we are finding this we are making the process much faster. Here is looking for these objects then actual distance will be computed and if it is actually less than this it's long then that will be taken into the result. This is what Gemini approach is. So here fewer futures may lead to more false alarms and time will be lost. Second thing is if we are adding more features there will be more complex computations. So one has to identify the optimal number of features and generally 1 to 3 are going to be the best result if 1 to 3 is the good number for the features for the optimal performance. Thank you.