 Good morning. My name is Francesco Nazzaro. Today I'm going to talk about image recognition and camera positioning with OpenCV, a tourist guide application. I work for Be Open Solutions in Rome. We developed software solutions for managing and publishing geospatial data using open source software based on Linux and Python. Image recognition is a field in evolution. Big companies invest a lot of resources in this research field. The main issues in image recognition are the following. First of all, we have to implement a human ability. This is a very big challenge. The images to compare can be distorted and oriented in different ways. So we have to detect several features, and the algorithm of detection will be a scaling variant. For our scope, we can feel to use a ready to use image recognition tool like Google Images. Let's see why we don't use it. This is a typical image taken with a smartphone or Google Glass, and we have to recognize the picture in the center and not the arm of my boss or the bag. In Google Images, we cannot define a library of images, so it tries to match all the features in our image, not only the pictures' features. So we recognize room with a picture in the center or other things. So we have to find another strategies. Let's try to understand what could be the best candidate as a feature. Let's take a rectangle, for example. We can find three possible areas. The zone A is a flat surface. If we move the area in any direction, the contact of the square A will not change in aspect, so we cannot localize the position of A. The zone B is an edge. If we move the zone in the vertical direction, the contact of the square B will change, but not in the horizontal direction, so we can distinguish the position of the zone B all in the vertical direction. The optimal case is the zone C that is a corner. In fact, we can distinguish the vertical and horizontal position of the zone. So the best candidate to be the features are the corner, because they have an orientation. So we need a corner detection algorithm. This algorithm should be scaling variant. The reason is shown in the figure. An only variant algorithm will recognize the left line as a corner, but it will recognize the same line, but bigger, as two or three corners. For the reason, we need a scaling variant algorithm. The solution is the scaling variant feature transform algorithm by David Law. Here are presented the basic steps of this algorithm. First of all, a difference of Gaussian's operator is applied to the image, varying the standard deviation. The result is a distribution, and the extrema of the distribution are the scaling variant blobs. Different of Gaussian's detects corners and edges. An algorithm like every corner detector is used to lay out edges. To obtain rotation invariance, an orientation is assigned to every key point. Then for each key point, a descriptor vector is created. Key points are deficient recognized. Key points matching is performed through nearest neighbor algorithm between descriptors. OpenCV has various recognition algorithms already implemented, and it has a Python binding. Let's see an example using SIFT algorithm. First of all, we import the image in black and white. Indeed, image recognition algorithms are color independent. We instantiate SIFT algorithm. Every image recognition algorithm in OpenCV has a function called detect and compute that returns key points and descriptors. This function may have four parameters for image recognition that can be optimized for the specific case. OpenCV also has a function to represent key points and their orientation on the image. The colored circles are the key points recognized, and they have an orientation. We perform the same projects on the image in the library. We can imagine that we can have several images in our library. We match descriptors with a nearest neighbor algorithm called Fland. We store the good matches following those ratio tests described in the article. Starting from the number of the key points matched, we can understand if the image has been recognized. So we can set a threshold on the number of matches, above which we can say that we have recognized the picture. This threshold is closely related to the algorithm and to the parameters used for the recognition. In this slide, I show an example of a major recognized and not recognized. We can see that the number of matches is significantly different. For this example, I use the SIFT algorithm. For our application, we use the SERF algorithm that is an approximation of SIFT. We use it because it is more computationally performed, so for real-time recognition is better. But the number of key points found by this algorithm is significantly lower than the difference between the recognized and not recognized case. The difference can be very small. So we have to find a strategy to avoid false positives. We developed an algorithm to compute the position of the observer, so we will have a method to exclude false positives. Let's try to compute the position of the observer with respect to the image. We have to find the transformation that links the library image and the picture in the photo. This transformation is called homography and links different projective planes. We have two cameras, A and B, looking at the same point P in a plane. The projections of P in A and B are respectively PA and PB, and we can express PA in function of PB, Ks and M. M is the homography and it can be expressed through R that is the rotation matrix and T that is the translation. And the Ks matrices are the camera intrinsic parameters, so we can compute them. This process is called camera calibration and it is performed through chessboard method. We take pictures of a chessboard from different angles and we find the corners of the chessboard with the algorithm, for example, and we find a distortion to obtain straight lines. We can see that in this photo the lines of the chessboard are slightly distorted and we have to correct this behavior. This process is already implemented in OpenCV through the functions find chessboard corners and calibrate camera. Starting from these parameters, we can apply a transformation to the image. Now we can see that the lines of the chessboard are more straight. Now we can compute the position of the image, of the picture in the image. First of all, we have to extract the matching key points for the image and for the library image. The OpenCV function find homography, extract the homography transformation from two set of points. And now we can create an array with four corners of the picture and then transform the points with the homography matrix M. This is done by the OpenCV function perspective transform and we note that the computed homography is good because the red rectangle is over the picture in the image. Let's see that we can use this method to exclude false positive. In this case, we have a lot of matches. So if we induce to think that we have recognized the picture, but the picture is not the same. If we compute the position, the picture positioning, we can note that it is wrong. In fact, the red rectangle does not fit with the real position of the picture. So we have false positive and we have found it because of the wrong positioning. We tested image recognition also in 3D case. And we can start with a picture of the constant time arc taken from the right. Image recognition in this case doesn't work because there are too many differences between the images. So we have image recognition works if in the library is included also an image of the arc from the right. So we need at least three images in the library from left, right and center. Obviously, we can apply this algorithm only in the case of fronts of objects very characteristic as arcs or charges, therefore in touristy case. We use the image recognition in a Google Glass application. It is a tourist guide application that plays media contents based on localization. For now, it was tested in the archaeological area of the Palatino. In this application, image recognition is used for advanced location based on what you are watching and based on the planimetry of the place. And it is used to play advanced information the artwork you are watching. Thank you for your attention. If you have any questions. Thank you. I have a question. Do you do any other transformations to enhance the image quality before you do any processing? For example, are you trying to detect if the image is blurry or there are sparks from street lights or something? Can you repeat? Let's say you take a picture with your mobile phone and the camera isn't focused properly. Do you take the picture as it is or do you have other algorithms how to improve the picture so you can run your analysis? No, we tested different cameras and every camera has a calibration matrix that corrects the distortion of the camera. But the images must be focused and the blurring is not allowed. In your example you had that red square and you could easily see that it was in some cases wrong but how does the computer see that it is wrong? Okay, one moment. With this computation we can compute also the position of the observer and we have to extract the rotation and translation from the homography. So OpenCV has also two useful functions. One is to solve PMP that from MTX and DIST that are camera intrinsic parameters and the distorted image extract the rotation vectors and the translation vectors. From this OpenCV has also implemented Rodriguez that is an algorithm that extracts from rotation vectors are rotation metrics and so we can compute the translation in the system reference of the image of the picture and then if the translation is how can I say is wrong so if the translation is back of the image for example we can exclude these two up or two down respect of the picture we can exclude this match. Okay, thank you. More questions? No questions? Okay, hi. Is this a presentation available somewhere? Yes, it's the link with out so you can find on my link in pages. One more question? One more question and you said you did the library that is you only recognize a certain number of pictures, how large can the library be for real time processing just 10 pictures, hundreds, thousands? We tested with 10 pictures in the library but we can improve this number but the time is the computational time is increased so we can parallelize the projects because they are independent. More questions? When you do image recognition I often see green lines going outside of the red square so basically you find the position of the image on your base image and why do you use features extracted outside of red area? Why don't you dream only to features that are inside the red square? When the features are outside the picture. Why do you compare them? Okay, for example, in this example, okay. So we can go to the arc example where we can see red square and some features like this you can see green lines below the red square. Okay, he performed a fit obviously for the positioning. So the key points outside the image, so the wrong key points are less than the except key points. So the fit exclude them probably with a key square test or? More questions? I have a question about the demo that you did. Where do you do the processing on the glass or do you send it somewhere to an external server to do it? I don't understand, excuse me. So in your application for the historical site you said that you display information depending on where you are in the site. Yeah. And for that you need to figure out where you are so you're comparing to an image in your library, right? It localizes with GPS but the error for GPS is the order of meters and we can find your localization through extract what you're watching in the place. So we can preselect the object that you're watching, the nearest object for the GPS. And you do the processing that you have in your image, right? You compare what you see with the library. Is that on the glass? No, it's server side. When you're converting to grayscale because you're potentially losing detail information, are you just doing a flat grayscale computation or are you doing any kind of optimization of the process of moving into grayscale? We use the surf algorithm because it is more computationally performant. I mean before you're running the surf algorithm you're converting a color image to grayscale. At least you suggest, I thought that was what you were doing first. Are you doing just, is there any kind of special grayscaling you're doing or is it just the conventional, does OpenCV do any kind of special? No, just this. More questions? No questions? Thank you for your attention.