 So, today we want to talk about an application of AI, which is computer vision, and we talk about that because we want to talk about two methods that we have to use before we apply any AI technique. They could be considered theoretically AI techniques, but they come from computer vision, which is a mixture of AI and non-AI techniques. So it could be a lot that we have to talk about today, but we will do our best. And so today we start by saying that AI is vision, so if you understand images, that requires some intelligence, which is saying intelligence is to recognize people, scenes, objects, patterns. The applications are familiar for everybody, face recognition, object recognition, auto-captioning of images, actually a recent development, robot navigation, and many more. So the question is how do we do this when this happens to be one of the most impressive exciting examples for AI? That's the reason that we are talking about. What we talk about is because we want to mention two methods, that if you have visual data, you have to apply those methods as one of the non-AI techniques. And then hopefully down the road, three weeks from now, we will see how AI itself does it, and then we have a better feeling for the differences. So given a digital image, we have basically two possible ways for recognition. So you want to do something impressive. If an image is given and you want to recognize the face, or you want to say that's a cat, that's a car, that's a cat, a dog playing in the park, whatever that is. So first we have the AI approach. Well, it's not first, I just grabbed it as first. So which is, you get the image, you give it to some sort of AI black box, and there it comes out classes or text. So that's basically it. I give an image to some sort of AI technology, and the other side, it tells me either a label, or a phrase, or a sentence, if I'm lucky. The second approach is computer vision slash AI approach. I'm not sure we can make that distinction, but there are still techniques that are part of computer vision and per se they don't exhibit any intelligence. So we have to keep them separate. So you get still the image. You give it to the black box, that's it called computer vision. And this black box will get you some features. It's usually not a black box, it's a white box. And then it goes to the black box of AI. And then, again, you get classes and text. Why did I say that for the computer vision is not a black box? Because we can verify every step. There is no hidden things in computer vision algorithm. Everything is deterministic. One, two, three, four, you can follow it. Usually there's a loop, there is some recursion. You understand what's going on. So this is usually a white box, this is a black box. So who says which one I should use? Well, most of the time it's not up to us. Most of the time the nature of the data will dictate to us which technique you use. Do you, can you go pure AI or you need a combination of computer vision and AI? Whereas, again, some of the colleagues may just get upset and say, well, you are separating computer vision from AI because the most recent success is because of AI, so people may not separate that anymore. But we have 2% of techniques that right now are impressing us coming from AI. 98% of techniques in computer vision have no intelligence, but they're good. So we have to be clear about this. So what is it that we talk about? So it's about feature extraction. So what is feature extraction? So feature is, give me some numbers that the computer understand the data. Not the data itself, not the raw data. The raw data is difficult to understand. Extract some descriptive rich numbers that describe the data. So that's what we call features. So in computer vision, we have basically two approaches. One is key point oriented and there are techniques for that like SIFT, SERF, and others. We will not talk about them. Again, computer vision algorithms, we can look them up, SIFT features, SERF features, you will find them. So it's not our focus, how they work. But they will give you several hundred, several hundreds or even thousands of features. So the way that they work, you give an image of a class and those algorithms find key points. They say this point, that point, that point, that point are important. So they are key points. They calculate it somehow. Because if I look at that wall, that wall has no key point. It's just surface, it's homogenous. There is no discontinuity. Nothing is happening. The derivative is zero everywhere, it's constant. If I look at this wall, okay, something is happening. If I look at the class, a lot is happening. So no key points, some key points, a lot of key points. So we don't need to go in detail. Second approach in computer vision is histogram oriented. Histogram oriented. Here we have techniques that call LBP, HOG or ELP and many others. So here what you get is just one histogram, one histogram. Which is just a vector of some numbers. Usually it's a short vector, 128, 256, 500 is a short vector. So, okay, if we take a look at an example, I cannot draw an image on the board, but something simple. So if I get an image and this image is a scene, there's a house, there is mountains, and so on. So one of the methods is, for example, Harris corner detection. Which is key points. So this is one of these techniques, similar to sift and surf. It finds key points. So what does that mean when we talk about key points? Basically, these methods will go in. I want to understand how non-intelligent technique do it, before we go into intelligent techniques. So it will find corners. So say this is a corner, this is a corner, corner, corner, corner, corner, corner, corner, corner, corner. The obvious ones are clear. But you may do that, apply it on some image, and you see also it finds a corner here, and you say, wow, there is no corner. Why if you zoom in, there is a digital corner? So not maybe for us. So I say that because the concept of corner may be counter-intuitive for us as human experts. Because we really want to see a corner. Wow, there is no corner in the digital world. The question is, what is your threshold for defining the sharp edges that build a corner? So you apply corner detection, and you will find some corners. So then around each corner, you get that out. For example, you see that here, you have something like this. So if I get a small sub-image around that point, around one of the key points, let's say, I don't know, this is 16 by 16 pixels. So usually don't see it unless in Photoshop you zoom on it and say, oh, this is this. And maybe this is slight curvature has caused the algorithm to say that's a corner, but we don't see it when we zoom out. Or the gap between, that's the semantic gap. That's the gap between computers and people. And when we find this, then we calculate some numbers for this here. For example, 22, whatever, 31, 53, 111, 165. So after you find a key point, then around that key point, you construct a small window, and then for that small window, you calculate some numbers. So these are features. Of course, let's say if you have, the length of these features has to be smaller than 16 by 16, then it doesn't make sense if that's not the case. You cannot need more features to describe the data than the data itself. Part of intelligence is for the features to be compact, short length. So for example, for SIFT, you get many features of length 128. So by default, that algorithm that is called SIFT, if you apply it on an image, usually if I capture a photo from this side of class, I would guess and I give it to SIFT, let's say 2,000 by 3,000 pixel, I would get four, 5,000 key points. So that's overwhelming. You cannot manually verify them. So four, 5,000 keywords and each one of them is described with 128 numbers. Well, that's massive. Well, but the image is 2,000 by 3,000, it's six million pixels. We are reducing it to 5,000 times 128, that's a lot. So what is it that you wanna do? If AI is vision and we as humans acquire and process 90% of information that we have and we work with visually, you cannot have AI without being able to process digital images, it cannot be. So that's why we are talking about it. So, which means then, your image is the union of those feature vectors Vi when I goes one to the number of key points. Now it's too small. So that's the sign of union, that's not a U. Well, that's a U, but the union of all feature vectors, like this, whereas I get N key points. So N key points, I get N key points and this is my V sub I. So for the computer, if you give me 5,000 key points and for each one of them I have a feature vector, the union of these feature vectors is what the computer understand under this image. So everything else is ignored. So that's what the computer sees this image. Okay. So this is, the problem is this, challenge is first, data is too large. And second, data may not be descriptive enough. So solution or solutions, well, it could use embedding, it could use pooling, which is a type of embedding, or it could use encoding. So one concern is if you give me 5,000 key points and each one of them is basically a vector of 128, that's still a lot of information. Well, we talked about PCA, that's good. We talked about PCA because I could apply PCA on 5,000 feature vectors and say some of them are garbage. Get rid of it, great, fantastic, we do that. But okay, I'm not, I'm before PCA. PCA is an AI technique. I wanna make the life as easy as I can. That's a philosophical concept. We don't wanna give garbage data to the AI. We wanna clean it up, we wanna make the best we can, and then we give it to AI and say you do the rest. You do actually the difficult part. So solution, which is one is feature vector, or feature vectors, and one is the so-called Vlad. I wanna see whether today we can talk about all this. We'll see because the two of them, one Vlad is relatively simple and straightforward. Fisher vector, it requires a lot of statistics and equations. We will try to scratch the surface just for the feature vector to get some idea what statistics are used, and then we will see. So feature vector is fundamentally embedding and pooling. So Vlad is encoding and increasing through that, increasing discrimination, discrimination. Different things, but one could regard Vlad as a simplification of feature vector. So we start with a complex one. Hopefully we get some sense what it does and then we move to the simple one. Okay, so let's say we have our data set X which is the set of XT and T goes from one to, up to capital T, and then we have mu lambda being a probability density function, probability density function, which models the generative process of elements of X. Many newcomers get confused with the terminology because then suddenly we leave the words that we know features and scenes and objects and faces and cats and dogs, and then we start talking about probability distribution, generative process, what is that? Why? X is your features. There is a generative process that generates those features, right? So for example, SIFT, that's a process that generates those numbers, okay? Like anything else, that's a distribution. Everything and anything is a distribution. Hopefully it's a Gaussian one. We can assume it's a Gaussian one. And then we will see. If it's not a Gaussian one, then we will resort to other things. So, but I have a distribution for my features and that distribution has some parameters lambda that I'm not talking about it yet. So this distribution models the generative process. If I can get a sense of the distribution, I can understand how features are generated. If I can understand how features are generated, I can pull features. The important ones, not the crappy ones. Doesn't matter what technique you use, they will give you some garbage and some good stuff. We still don't know neural networks, so we don't know that black box, what it does with the millions of weights. So we wanna do it in a different way. And then we have lambda which is lambda one, lambda two, up to lambda M, capital M transposed, which is a vector in the M dimensional space, parameters of U lambda. So that distribution has some parameters. So I grab any parameter, the standard deviation, the position of the mean, the location, the spread, all that could be able, how can I define that distribution? Okay, good, I understand up so far. So now we need some statistics. Well, why we are saying this? Look, so if I have a function, if I have this and I have a lambda O, or, this cannot be distributed, let's call it a function L that we don't know at the moment, or I have this. So I may get two different distributions that will contribute to a function that will be of some use for me. And then I see this, of course, the standard deviation is small, so I can really map this to my lambda O with a lot less error than I can do it here. So fits the data more accurately. So if that helps you to come with me, to why is that important? What is it that we talk about distribution? Doesn't matter who does it. The deterministic algorithm, a stupid algorithm, a deep neural network is always a distribution at the estimate. The distribution of the data, the distribution of the future is always that. There's nothing else. So the question that we are trying to just formalize is okay, there are some parameters that give me a nice fit distribution to the data. There are some parameters that are crappy. So I have to find the one that gives me a fit, good distribution of the data, the generative process. The generative process, how are those features generated? We don't know because that process is highly unlinear because if I get a sense of it, I can then do a better job in selecting good features. Okay, so now we have to go in the difficult part and talk about the statistics. So in the statistics, there is the concept of the score function. The score function is the gradient which is the partial derivative, the partial derivative with respect to WRT, parameter lambda of the natural log logarithm of the likelihood function, of the likelihood function. So that score function is given as, so the score function is given as the partial derivative with respect to lambda log of mu lambda X. When you estimate a distribution, any distribution has a parameter. Usually, if you are looking for, let me see, I can do it without making it too complicated. So let's say, keep it here. Side note, because otherwise, you don't understand why it's a log likelihood. So let's say A and B, completely different terminology, are independent and identically distributed random variables. They are IIDs, independent identically distributed. So then the function AB, given a distribution mu, is given as F of A given U times F of B given U is the likelihood. We should know that also, if you have heard about Bayesian theorem, that we will talk about it after the reading week. So if I understand F as a probability density function, what is the probability and A and B are correct, are good, are together, given a distribution? If they are independent, that's the probability of A given U times the probability of F be given mu, if they are independent. If they are not independent, that's not that easy. Okay, but we cannot really work with this because that's multiplication that's really difficult to do. So people, what they do, they use the log of F of AB given U, which is of course the log of F of A given U plus the log of FB given U. So what we did, we fundamentally, and we should not have done this, but I'm just doing it just to be safe. So we converted this multiplication in an addition by using log. So now we have a log likelihood. Mathematical convenience. There's no magic or Y log. If you could work with this, that would give us the same result, but we can't because things get really crazily complex. And working with the product of probabilities is really tough. So addition is nice, is easier. Okay, so if I accept that, that was a side note. I just didn't want to, that we get into two complicated things. So that's being a side note. So now we want to go back, see how much we have to do to just recognize that the cat is a cat. So when we do this, when we do this torture, then we will appreciate the deep networks. Just give it to the network, it says that's a cat. But you get to understand how it's done, otherwise you are not a good engineer. So the score function, again, the score function was equal the gradient with respect to lambda log of U lambda of X and this is equal the gradient of lambda with respect to log of the probability of X given U lambda. So what's the probability? That's what you wrote. I wrote F to keep it general. That's a density function, that's a probability. So let X be the set of D dimensional local descriptors. I want to connect it to what you're talking about. So these are local descriptors. That's a global image. Any of this is a local descriptor. Any of this feature vectors, we didn't talk about how we calculate these numbers. We are not interested. It's not our business at the moment. Somebody gives me 5,000 vectors each 128 numbers and says this is a cat. Can you recognize it? I want to process that. So let X be the set of D dimensional local descriptors extracted from an image. Extracted from an image. It could come, for example, it could come from SIFT, it could come from SURF, it could come from many computer vision techniques that do not have intelligence. Forgive me, 80% of researchers do not have intelligence in the common sense in the AI community. That means they don't learn something. They have some deterministic steps. They do not have an iteration to adjust. Then we can say that my score function as a function of lambda operating on X, the G, so G lambda of X is equal to sum of L, a new function L as a function of lambda times the gradient with respect to lambda of log of U lambda of X, of XT, sorry. Whereas T goes from one to T. Do I still have it? So now I'm back to this. So I'm back here. If you have difficulty, maybe I stick, do we have, when I write here, do you guys have difficulty to read it and write? What should we do? Stick in the center? Yeah. Okay, at least I'm being fair. But so the only problem is then I have to stick to the center. Should I stick to the center? Okay. So let me, let me finish it. Sorry again. Independent and identical distribution, yes. So now we are introducing something new. So that's a function that has a parameter lambda and is operating on X. We still have not named it anything because that's our Fisher vector. Whatever it does, we call it Fisher vector. So that's our Fisher vector. Okay, what is that? So I will finish this and then I stick to the center. So Fisher vector is a sum of normalized gradient statistics, gradient statistics computed for each descriptor. And when we say descriptor, we mean feature vector. That's the same thing. So that's a descriptor. So the idea of Fisher vector is give me a normalized gradient statistics. So if you bring in the statistics, we have to become better. So what is the simplest statistic you can use? Instead of working with the data, work with the average of the data. Substract the average. Do something with the statistic. Why? Why? Why statistics? Any idea? Why statistics? Why not calculus? Sorry? Yeah, but it doesn't match with us. AI is about data. You get data. When you get data, you have samples and population. The only field of mathematics that can help you out is statistics. There is no other way. We cannot use algebra right there, where you wanna find out what is important. After we put things in matrix format, then we can use other stuff. So but that's the reason that we talk about the statistics. So and then this operation that you map X of t to some sort of function that uses an Fisher kernel, fk, for each data point, Xt, and give us, of course, that function L of lambda times the gradient of log of u, lambda of Xt is an embedding. So this operation, if I add them up, that's your Fisher vector. If you add all of them up. So imagine you have many of this underneath of each other. First, you embed them, then you add them. So is an embedding of local descriptors Xt in a higher dimensional space in a higher dimensional space, which is easier for a linear classifier, which is easier for a linear classifier. Okay, so we are not done yet, but okay, so you're getting some crucial information. So first of all, you're talking about normalized gradients that are based on some statistics. And then we map this with some sort of function that we still don't know how we get that function to put the Fisher vector together. I know it could be frustrating, stick with me. So around reading, we have a better picture why we are doing this. Is an embedding, that's a keyboard. Embedding is I read the data, put it in another space. I read the data, put it in another space. A better space, a more compact space, a more descriptive space. Otherwise why I'm doing this, just use the data. For a local descriptor in a higher dimensional space, which is easier for a linear classifier. Oh, I see. So we wanna recognize that this is a house, this is a mountain, this is a river. You wanna recognize that. You wanna recognize the scene. That's a highly non-linear problem. So to make it easy for a class, after this it has to be a classifier. So I give a bunch of vector and the classifier says river. Bunch of another vector, it says house. Bunch of another vector, it says mountain. But that classification, that you have multiple classes for which you should have trained. You cannot recognize the entire planet. You have recognized for river, landscape, hills, mountains, houses, buildings. And if anything else comes, you say I don't know. So you may have one class I don't know. Other than that, everything that you do is highly non-linear. That approximation, that we wanna find what function maps X to Y. And X here is our features and Y is the class, river, house, mountain. It's very non-linear. So with all this stuff that I'm still skipping a lot of it, we wanna make it easier for the classifier. So if it is non-linear to 85%, I wanna make it non-linear to 25%. So to make the problem easier. Okay, so let's stick to the center. So what is this mysterious L of lambda? So this could be, for example, the Holesky decomposition. This is where it gets really interesting and complicated in terms of how we use the statistics. So the inverse of F of lambda can be given of L lambda transposed times L lambda. And the kernel, we come back to kernels. If I use the Fisher kernel, and I'm looking at X and Y, we will not say more, but we will come back to this. So a kernel is a function with certain properties that takes two numbers and tells you how similar it is in a higher dimensional space. We call a kernel trick. When we get to SVM support vector machines, we talk about it. At the moment, we don't wanna talk about it. We just take, there is a kernel that uses Fisher, that we call Fisher kernel. So that Fisher kernel is the matrix for X transposed as a function of lambda times the inverse Fisher information matrix times the G, the Fisher vector that we calculated here. So I'm putting it in a matrix format. So I need to do this. I need to decompose the Fisher vectors that I put. I put it in a matrix and I need to decompose it. And after I decompose it, I can write it this way. So and then I do this, the magic of kernels. Give me some numbers. So this is the embedding part, actually. So this magic will give me better numbers. Okay, whatever you say. So this is the Fisher kernel. I don't wanna go more than this because if you don't talk about how we decompose and how we do this, okay, this is not the statistics course. So we are just taking a little bit of it. Okay, so it's still a lot of steps to do this. But I don't have to do it. If you go, there is a website, especially for the two techniques that we talk about it and you can just download the algorithm and use it. What do you wanna have some minimal understanding? So I wanna give you a simpler one, which is much simpler because people said, ah, that's fantastic, that's great, that's a lot of, that's 120 years of statistics behind it, but I didn't take a statistic course, so what should I do? Okay, so we have other stuff. But if you stick with AI and computer vision, learn this, learn Fisher information matrix. So the next one, which is very simple, actually is VLAB, which stands for the vector for locally aggregated, for locally aggregated descriptors. Again, we still are looking at that image and I'm getting many, many descriptors and I wanna aggregate them. So I did that really sophisticated with Fisher vector, with a lot of statistics behind it, with a decomposition and embedding and all that. We went back to the origin of distribution, this is distribution, so you can reconstruct it by multiplying the individual probabilities, but that's too difficult. Take the log, it becomes plus, it becomes easier. Bring in the Fisher information matrix, all that. If you don't understand all of it, it's not your fault because you're just scratching the surface. So I have not taught you Fisher vector. I have just mentioned Fisher vector to you. There's a lot of additional information necessary for that. Hence, let's talk about something simpler. So why we do this then? Well, because again, we wanna build the foundation for why we do things with neural networks and not with this type of stuff, which are very, very useful and very good, but we don't. Maybe other people. So again, the question is how to recognize images. The question comes up again. So the Fisher vector was fundamentally suitable if you have key points, well, at least I would do it. If you have key points. So if you have an image of 1,000 pixels, you have 1 million pixels. You cannot process all of that. So, but if a method like SIFT gives me 5,000 key points, so 5,000 out of a million. Okay, that's not bad. But then each of one is vector, then I use this, then I embed it, hopefully I can do a easier classification. So how do we recognize it? One of the very established method is the so-called bag of visual words. It was inspired by bag of visual words. It was inspired by bag of words, which is a method for text document classification, website classification, spam recognition, all that. We work a bag of features or bag of words. The words that are important describe the content of a website, for example. But okay, what is a word in an image? So what is a visual word? So if I look at that, it could be that any of those is a visual word. So if you give me an image, if I draw my kindergarten picture again, so then let's say, so I grab a point here, bring it out, that's a visual word. Usually we have a fixed size, most commonly 16 by 16. So if I grab one here, because I have some key points, and bring it out, that's a visual word. So now, there is not infinite number of visual words that can construct an image. That's an interesting thing. So if you have some small way of getting enough visual words, let's say edges, this direction, this direction, this direction, maybe a corner this way, a corner this way, maybe some curves, then you could paint any image with those words. Like a dictionary. How many of these words do I need to paint an image? A very interesting question. Is there such a thing as a visual dictionary that they could come up with? And then using this tiny 16 by 16 pixel, if I knew how to use which word here, which word here, which word here, I could construct the image of a cat. Is it possible? People do it. Because the bag of visual words is a very, very competitive technique. To this day, for certain application, the bag of visual words can beat deep networks for image recognition. So okay, so given an image, given an image, given an image I, divided into small cells or windows of size N by N, usually, in general, is 16 by 16. And nobody knows why. It just worked nicely with 16 by 16. If you make it 14 by 14, it collapses. If you make it 18 by 18, for most images, it doesn't converge. One of those things that we know empirically, but we cannot theorize it. Big part of AI has become empirical. So okay, which means what? So you get this image, whatever it contains, and you split it in many, many cells or windows of this N by N, which is generally 16 by 16. And then, for example, this one is an edge like this. This one is a curve like this. This one is a corner like this. So you collect visual words. It has to be small enough to count as a word. It cannot be a big part of the image because then it gets too complicated. It has to be something that you can describe with one or two words. That's a horizontal edge. That's a slight horizontal curve. That's a perpendicular corner, so I can describe it. But if you give me a big portion of the image, I cannot describe it easily. I have to write half a page. So every cell is a visual word. So now the challenge is I have to go back and delete here. Okay, so we usually vectorize visual words for convenient calculation. For convenient calculations, calculations. You see, this is the type of things that we have to do when we do computer vision algorithm. There's no learning involved. We have to think about every step. I wanna recognize that this is a river. That's a house, yes. Sorry, again, louder. To a certain extent, yes, yeah. So you mean thresholding for what is what? It has its limits. It has its limits with respect to the resolution. But again, if the data is well prepared, you can use it basically for every application. So the alternative is, of course, you can also use, you don't need to do this. You still can use key points. But visual dictionary usually do this. So if you collect all of them, you get many, many of this. So you get many, many edges that are horizontal and with slight differences. And you get many, many curves that are almost the same. So can I keep all of them? That's not smart. This is not intelligence. If I have to keep all visual words. Well, I need a representative for them. So you need to encode them. You mean, yeah, how do I encode them? Well, I could find an average. I could take an average of this and I would be fine. As a matter of fact, we sort of do. We apply the simplest techniques, simplest clustering technique that we have. Hopefully we talk about it next week. K-means, which is give me an average, calculate the similarity of every one of them from the average and put similar things in the same bucket. Okay. But still, this is not encoding because this is the raw data. We cannot work with raw data. Raw data is brutal. We need to encode it, which means what? Grasp the essence of the data and give it to me in a form that is easier to process, like the intelligent Mr. Fisher. But that's too complicated. Can you do it simpler for me? What's the question? What is the difference between embedding and encoding? So embedding usually has also pulling in it. So you may reduce the dimensionality in some way, but encoding may not touch the dimensionality, just encodes. So makes the classification easier, such that whatever that is, you still after that, you still may need PCA or LDA. You may still need that, depending on how high dimensional the problem is. Okay. So what is the theory? What is the theory? The theory says, you get an image, you apply the bag of visual words, you get many, many visual words. You get millions of them. And you give that to a classifier and the classifier says, that's a cat. That's a house. That's a bicycle. So that's the theory. The theory says, find a bag of visual words, a bunch of visual words, give it to a good classifier. So this is computer vision. It cannot solve the problem. It gives it to AI. I say, okay, I get the visual words. That was hard enough. Now you do the rest. Use any classifier is within the domain of AI because it has some iterations in it. It minimizes or maximizes something. So what is the practice of it? The theory says, do this. The practice is, first of all, you have way too many vectors, especially if you go with this, which we call dense sampling. So you sample everything. That one is a sparse sampling. You only sample the key points, the significant locations in the image. Sparse sampling, dense sampling. You may have redundancy. Well, good thing we know PCA. I can't get rid of that. That's not an issue. You may have noise. So you may have to filter stuff. And in image, we have all type of noise. We have void Gaussian noise. We have impulsive noise. We have speckle noise. All type of noise. So in the theory says, grab an image, give me a bag of visual words, which are basically tiny part of the image that are similar. And then give it to a good classifier. And the classifier tells me what it is. So we wanna follow how is the intelligence manifested in this type of techniques? You can do this in many different ways. There are 5,000 papers on how to do this differently. It's not about detail. It's about the big picture. So what is the general approach? Is that a design problem that you have such a room and you have to do this? Should be just in the center and a little bit brighter. Okay, whatever. Not an intelligent design. So general approach. What is the general approach? Build a code book C with element C1, C2 up to Cn. From M, much larger than N, feature vectors. Feature vectors, which are for us vectorized visual words. So take your visual words that we just had some examples with, vectorize it, make it a vector, consider it as your feature vector. It has that feature vector, you have five millions of them. So there are many, many, but we want just a small number of them. So if you have a million feature vector, I wanna construct a code book, which is my dictionary. So the code book, which is the dictionary, usually we set the size of dictionary. I said, I want a dictionary with a thousand words. You cannot have 6,000 words, 600,000 words like Oxford or Webster, you can't. We cannot afford that. Usually we start with 600, 1,000, 1,200, 1,600, maybe 2,000. Because if you go 2,000 and cross more than 2,000 visual words, oh my God, you need massive computational power. You have to extract 2,000 visual words. You know how much it takes. So N is very small, and M, the number of feature vectors is much larger. You give me five million visual words that you extract and I wanna make it 1,500, wow. And of course I don't wanna lose anything because if I lose stuff, so what would happen if I lose the ears of the cat? So not physically, so. In the image, what would happen? I cannot recognize the cat. So what is the idea? What is the idea? Use a clustering algorithm like K-Means. Wow, this is something we don't know. So if you use a clustering algorithm like K-Means, which at the moment I would say we don't even know what clustering means, officially we don't, and I have no idea what K-Means is. Apparently this is one of those simple AI techniques that you give the bunch of number and it will group them in some categories. Okay, why is that important? Because as I drew that you get many, many edges or corners or curves that are similar, all of them are the same board. So that super algorithm has to put all of them that are similar in the same category and say this is a horizontal line, this is a vertical line, this is a corner like this, this is a corner like this, all of them should be in the same class. But we don't know what that is, it doesn't matter, we come back to it. So then if you do this, the CI in my code book is the center of N classes found in the data. So you cluster everything you have, but then you don't wanna work with the entire data. Maybe we go with the cluster center with the prototype of each category, with the most typical horizontal edge, most typical vertical edge, most typical corner. Okay, so what is the core idea of lab? Accumulate for each visual word CI, I didn't say anything here because I don't wanna go too much into it. So if this is my classes and I have no idea what is what, I guess I have to draw at least something. So this could be my horizontal line. So each one of them is a horizontal line. This could be my vertical lines, each one of them is one of those 16 by 16 in the image. This could be corners that look like this. This could be parts that are a 45 degree curve. This could be something else. So a corner, let's say, looking like this. So this would be the cluster. And if I cluster for each one of them, each category has a center, the blue cross here. So the blue cross is the class center, which is the most typical representative of that category. It's the most prototypical horizontal edge. So we need a technique that does that for us. So I read the image and 5,000 images look like this, 2,000 images look like this, 3,000 images look like this. Why images is the wrong word? Neighborhoods, windows, 16 by 16 is hardly qualified to be an image. Okay, we don't know how to do this. I know some of you do, but as a class we don't. We don't know how to do that. We come back to it hopefully next week. So we wanna understand what is the, if Vlad is promising us to do the sort of the same thing easier than Fisher vector. So I don't need that much statistics. Okay, how is it gonna work? Maybe I use Vlad. So accumulate for each visual world CI. So these are visual world CI. You see, we have broken down the image that we see as human and we understand at a glance. We have broken it down in tiny parts to make the computer understand the same thing that they understand. So accumulate for each visual world CI, the difference of the vectors x assigned to CI. So if each one of them is a CI, there are many, many, so the blue one is the CI, but there are many, many, many, many, right? Which are other x's. So there are other, many, many other vectors in the same class assigned to that class center, to that prototype. So accumulate for each visual world CI, the difference of the vectors x assigned to a CI, which is x minus CI. In general, distribution of data with respect to WRC, the class centers. So we accumulate the difference to the class center. So the so-called Vlad vector is Vij is equal, the Vlad vector is the sum of xj minus CIj. Whereas xj is coming from the nearest neighbor and n of x, sorry, and n of x is CI. So what does it say? Look at just the x's that belong to the same class and build the difference of every x with the class center and add them up, okay? What else? That's it. You've got to be kidding me. So I can use this instead of visual vector? Yes, that's a simplification of that. Why is that a simplification of visual vector? We have, again, we have a scratch to surface of visual vector, but we saw enough to guess why this is a simplification of visual vector. Or first, tell me why this should be better? Why this should be better? We are just subtracting the class average, yes. Okay, I had the question first. Okay, what is the question? n, n is the nearest neighbor. So find the nearest neighbor of x such that the class center is CI for every given x. So for every given x, I find that, oh, this is very close to this one. This is very close to this one. This is very close to this one. This is very close to this one. This is very close to this one. I find the nearest neighbor. The concept of nearest neighbor sticks with us. It's a very important concept. Things that are close to each other, again, Disney, are similar, happen to be similar. So why do we subtract mean? This is the second time that we are subtracting mean. But this time has a different purpose, yes. But why? What is the effect of filtering? Anybody else? Yes. The intra-variability inside the class, if you just use the data, you don't see it. You don't see, where can I draw it? Okay, so let me draw something here. That's a messy, but sorry about that. So let's say this is your class center, and then you have data around it. So now compare this guy, right, with this guy. Both of them are in the same class, but one of them is very close to the average. If you use the data, you don't have that information. If I subtract the mean, the proximity to the prototype is embedded and coded in the features. So I know what is closer to be a horizontal edge. So we arrive that. Okay, so that's the purpose of subtracting mean. That's the purpose of using mean, sorry. What is the similarity of this with Fisher vector? Somebody may say it's not similar at all. You just showed me one equation here, x minus c, come on. And what we had for Fisher vector was 20 equations out of 50. But this is a simplification of that. We started with the gradient of statistics, partial derivatives. What do you think a difference does? It's the same thing. It's a very simple approximation of gradient. So the reason that both of them are good is because both of them use derivatives. They look at the changes of statistics, averages, and other stuff. So this is a first-order statistic. Okay, if that works, so let's use second-order statistics. Well, I take it easy. It becomes more difficult. So let's stick with this and use it. And if things work, we can think about other sort of stuff. So this is, so let me write it again. This is nearest neighbor. This is the jth component of the descriptor x, of the descriptor. And this is, of course, the corresponding cluster center. And the VLAD does not really perform well unless we do also a final step, which is one of those secrets that nobody tells us. That your data has to be normalized. If things are not normalized, it will take a lot more time, even for the deepest network, to figure it out. Not just VLAD with my humble sum of xi minus xi. So not even for that, even for really sophisticated things. So we do an L2 normalization. So these are my vectors that I now use in my dictionary. But before I put it in the dictionary, I normalize it. How should I write it? In terms of vectors or matrices, how do I write it? How do I normalize it? Sorry? No? Okay. Somebody makes my day by telling me how to do it. Vector divided by the magnitude of the vector. Thank you. You made my day. So the vector divided by the length of the vector, or as he said, the magnitude. Length of the vector. So just look it up at home. So what does that mean, this? This is the Euclidean distance. So if I have a vector of 2.1, 1.2, and this is my vector, how do I get the length? I don't wanna go back to high school. So I divide every vector, I normalize it by dividing it by its length. So now I have it normalized. I approximated partial derivatives with a simple derivative. I use first-order statistic, which is the mean of every class. Then I normalize it. Now your data is something that can be given to a simple classifier to say what is a house? What is this? What is that? So what is that any other way for doing this? Because this is way too much. So okay, so what is the final chain? What is the final chain? This is pre-2012. So we are still not, we have not still not arrived at 2012. So you get the image, we give it to something like SIFT. You look it up, shift invariant feature transform. SIFT will give us many, many, many features. Hundreds of thousands of features. So feature vectors, feature vectors. SIFT works with key points. So that's a sparse sampling. If I go with totally getting everything, that's dense sampling. Then I give this to Vlad. Vlad gives me a different representation that I'm representing with this vertical one because I don't know how else should I visualize it. So that's still the dimension. So what Vlad is not much of, is not about dimensionality reduction, is not about pooling, is about making things more easier to discriminate and normalizing it for easier processing. So I may get a different representation but the dimensionality is fundamentally the same thing. You say, okay, good that I know PCA, then I get D prime and D prime is much smaller than D. Then I give it to a classifier. And then the classifier tells me that's a dog. That's a pre-2012 processing chain for image recognition. You have to sweat to get some of it done. It's not like, okay, have you downloaded the TensorFlow? Just run it. Not gonna work that way. You have to put something together. What doesn't mean we don't need that anymore? No, not at all. In some cases, we have to do this. In what cases we have to do this? In what cases we still have to do this? When you don't have enough data, when you don't have labeled data, then you still have to do this. For a dictionary approach, I do not need a lot of, I don't need a million images. Give me 500 images, I'm good to go. We then sample it. Yes. The nearest neighbor classifier is a classifier and K-means is a classifier. But there are different concepts. We will, there are distinct, yes. We will talk about them. We have also approximate nearest neighbor and many other things. So, okay, so hopefully by, on Thursday we can start talking about actual AI techniques. See you then.