 Ever since the first ImageNet competition ILS VRC in 2012, deep learning computer vision have come a long way. We have new architectures with better performance, all to solve the problem of image classification. Despite the changes in architecture, the core process remains the same. You take a bunch of labeled images of different categories, you show it to your classifier for training, and eventually test the classifier on unseen images. Although it seems that we are on the brink of solving image classification, there's still one fundamental flaw with all of these architectures. And this flaw is actually to do with the process of image classification itself. So what if we don't really have all images of every single category? In the example of classifying animals, what if we don't have enough labeled samples of squids or giraffes? What do we do then? Well, our image classifier certainly won't know. I'm AJ Halthor and in this video we're going to take a look at exactly how we can build or rather synthesize an image classifier without having every category sample of an object. So let's get to it. Let's restate our problem here. We need to create an image classifier without having some samples of some object categories. So like in the example of creating an animal classifier, we want to create an animal classifier without having as many giraffe samples or probably no giraffe samples. So one way we could possibly do this is zero shot learning. That is learning from zero examples. Instead of training the classifier with an image, we can input a description of that object. This description can be text, word vectors, or any other input type. And such types are called modalities. So we are now able to train our classifier by describing the objects for which we don't have a labeled image category for. So does that solve our problem? Well, not quite. How do we know which features to define and how do we know what to call them? There could be name ambiguity when many people are contributing to the descriptions. So instead of actually describing an object category, what if we were to draw it? This would remove the problem of naming ambiguity and they do say that a picture describes a thousand words. The paper Sketch a Classifier released by researchers at the University of London does exactly that. We are going to discuss three types of models that leverage drawings or sketches to synthesize image classifiers. The first model converts a sketch classifier to a photo classifier. The second model uses a sketch or some sketches to synthesize a photo classifier. And third model uses some sketches and a photo classifier to synthesize another more fine grained photo classifier. These models that do the magic are called model regression networks or MRNs. Notice the goal of an MRN is to generate an image classifier. Let's take a look at the three ways of doing this using sketches with some math. Let the MRN be a parametric model F parameterized by big theta. We also let the input sketch classifier be a parametric model little f parameterized by theta s and the output image classifier is a parametric model little g parameterized by theta p. They all have their own parameters that need to be learned. Parametric models make life easier because the problem of determining a model is reduced to the problem of determining their parameters. We train a sketch classifier to get the parameters theta s. This is input to our MRN and trained to get the parameters of the photo classifier theta p. Since the output of the MRN is an approximation, I put a cap on it. Since we have theta p and the photo classifier is parametric, we have essentially synthesized a photo classifier. Now instead of an entire sketch classifier, what if the input of the MRN is just a sketch or a few sketches? Let's represent the sketch by sigma parameterized by the feature extractor phi. We feed k such sketches to train the MRN in order to generate a photo classifier. Now the third type of MRN takes a trained photo classifier h parameterized by theta p h and a sketch sigma to synthesize a more fine grained classifier. You're probably thinking, why are you using a photo classifier to generate another photo classifier? Well, the input classifier is more generic. It could be a bird classifier, for example. But when input to the MRN with the sketch of, say, a swan, then we can make it a swan classifier. That's pretty neat, right? I talked about the MRN and its three types. But what exactly is the model regression network? That is, what is the MRN? To synthesize a binary classifier, the MRN is a multi-layer perceptron. And to synthesize a multi-class classifier, the MRN is a six-layer fully convolution network. Let's now determine the objective function. Remember, the output of an MRN is an image classifier. But that is defined by the parameters. If theta p is the original classifier and theta p hat are the parameters predicted by the MRN, then the loss is a simple L2 norm of the difference between them. But there is a problem with this intuitively. The difference between the parameters doesn't necessarily signify the distance between the models themselves. A small difference in weights could lead to drastic changes in the results. So to solve this problem, instead of just comparing the parameters of the synthesize model and the actual photo model, let's actually compare their results or their performance. And we can model this or measure this using a performance loss. Let y be the ground truth and y hat be the output predicted by the MRN synthesized image classifier. For multi-class classification, y is a one-hot encoded vector and y hat is a vector of probabilities. The loss is computed with good old cross entropy. The overall loss is thus a weighted sum of these two losses. The regression loss and the performance loss. And we need to find the parameters of theta that minimize this result. This paper uses an alpha that's equal to 0.01 and beta equal to 1. And the loss is minimized using the atom optimizer. It uses 75,000 sketches from the sketchy dataset along with 12,500 photos over 125 categories. They also use 56,000 photos in ImageNet that match the categories in sketchy. Training the MRN with sketch models synthesizes photo classifiers with a 78-80% performance on multi-class classification. Interestingly, simple standalone sketches when fed to the MRNs yield better photo classifiers with a performance of about 83%. Now that is cool. So what have we learned today? Do photo classifiers perform better when you have all the data you need? Well yes, yes they do. But it's usually the case where we don't simply have the abundance of annotated data. And this research shows us a method to overcome that drawback. We design a model regression network, an MRN, that are used to synthesize photo classifiers. MRNs are classified into three types depending on their inputs. We have an input sketch classifier, or an input sketch or sketches, or just input photo classifier and a sketch to create a more fine-tuned photo classifier. And that's all I have for you now. So if you like what you saw, hit that like button. If you like content like this, like AI, machine learning, deep learning, and data sciences, then hit that subscribe button. Ring that bell icon for instant notifications when I upload. Links to the main paper and other resources are down in the description below so check them out. Still haven't gotten your daily dose of AI yet, then click or tap one of the videos right here for an awesome video and I will see you in the next one. See ya.