 Look at these faces now imagine we have 10,000 Sam's and 10,000 Lisbets Who is this? If you have half a brain you guessed right Let's see how the model does here. Hey, this is Lisbeth. This is Sam. This is also Sam Easy You did pretty good model. Good job easy But really I couldn't get my friends to take 10,000 selfies. It's more like three to four selfies Now look at this face. Who is this? Again, if you have half a brain you guessed right Let's see how the model did here though All right, this is Sam. Oh This is someone outside the data set. Oh All right, this is Sam for sure lock it Hi, okay. I'm out. This is done. This is so dumb. It's done. I'm out. I'm out. Am I done? Yes, you are clearly the model didn't do that well But what if I only had one example for each of them, you know the drill who is this Huh, but you didn't expect that but it doesn't matter You just don't know who it is You just don't recognize the face and that's the correct answer I'm gonna save the model from embarrassment and just let you know that it didn't really perform that well There are several key takeaways from these three exercises humans learn fast We only need to see a face once or maybe a couple of times in order to know that we either recognize a face Or we just don't this is the idea behind one shot learning and few shot learning and And the second takeaway is that well machines need a lot of data Sometimes even two faces with 10,000 training examples isn't enough The reason for such a discrepancy is prior knowledge When I showed you the three cases we went in there with the knowledge of the world You know how humans look you know how to tell them apart You know the different races of humans You know how to identify humans wearing different pieces of clothing and so on But a machine is going into this with no prior knowledge. It knows nothing But if we do give some prior knowledge to our systems, how do you think it would fare? Well, it depends on that prior knowledge Now prior knowledge can come in many forms, but I'll stick to three types based on a blog post so that it's just consistent Prior knowledge about similarity prior knowledge about learning and prior knowledge about data Humans are so knowledgeable that we can make use of all these three before presented with a new task But in this video, we are going to shift focus only to the first That is prior knowledge based on similarity And we can tackle the others maybe in a separate video But before we continue this video is sponsored partially by kite They provide a code completion service for machine learning code It integrates super well with your editors and even jupiter notebooks So click the link in the description to try kite for free now back to the video So we are here now looking to train a model to recognize faces It takes an image and it returns a person's name Probably it's the result of like a softmax operation But what role would prior knowledge on similarity play here? The way that we define the model now there are two major disadvantages The first is that we need way too many examples to learn the parameters of the model And two is that well if we need the model to recognize a new face We would need to add a node to the softmax layer to represent this new class And then we since we modified the network architecture, we would need to train the network again Now to solve for both of these disadvantages We transform the problem that the model is solving Instead of providing an image and telling the network to figure out who it is We can give the network two images and tell the network to determine if the images are of the same person or of different people To this model, we can then inject prior knowledge while training I'll definitely get into this prior shortly But the point I'm trying to make here is to show you how this new setup Overcomes both disadvantages of the traditional approach The model takes in pairs of images from your dataset So it's really easy to create these pairs from a small set of images Combining them every which way so we don't need as many training images Also, let's say that we add a new face to the model that it's never seen before We no longer need to modify the network architecture. It's just a binary classification problem Great. So now that we know that this new approach can decrease training data and is scalable Let's get into the details about this model and this prior Now the model isn't just a single network It's a pair of identical networks that converge into this node called a similarity function And it terminates at a sigmoin neuron that outputs if the faces are similar or different Since we're dealing with a pair of identical networks This setup is called a siamese network because siamese is twins In this case the networks are convolutional neural networks since we are dealing with image inputs The output of these networks though aren't a typical softmax function They are embeddings. These are basically vectors that form a compressed representation of the image You'd implement this by using a typical neural network for classification But then you remove the last softmax layer Let's say that the output of the embedding of each network is a 64 dimensional vector Each of these vectors represents the input image Now we pass it into the similarity function This will compute the square difference between the vectors It will be small if they are similar and this number will be large if they are different We can convert this to a probability value by squishing it into a sigmoin function If the output is below a certain threshold they are different If above that certain threshold then they're similar This function that determines the similarity isn't learned by the network. It is prior knowledge The network just knows how to determine if two embeddings are similar even before looking at the first image Hence it's prior knowledge Let's take a really quick look at the math and what's being predicted and what the loss is So let's start with the images. They are i and j Then each of these networks are convolutional neural networks the same network which is represented by f And so the embeddings that are produced are f of i and f of j because remember neural networks are just functions So you're just passing in an image to a function and its output will be the embedding So hence it's represented as f of i and f of j These are 64 dimensional vectors The similarity function takes these two numbers and computes the vector difference This is another 64 dimensional vector I'm just taking squared terms, but you can also just take the raw absolute values The similarity function then takes the sum of all the values in this vector to give us a number Hence the sigma where the k is iterating over every neuron And it iterates from 1 to 64 because there are 64 numbers or 64 neurons Now in the network we pass this number to the sigmoid function It has an edge weight w and a bias b so the final output would well look like this A typical sigmoid function Y hat here represents the probability of being similar if above a certain threshold they are similar If below a certain threshold they are different So now this is great for inference like during the time of testing But to make this network learn it is done by back propagation And the loss is the same as that of like the binary cost entropy loss with the sigmoid Y and Y hat should be close to each other The network will tune its parameters to minimize this loss So how is this used in production? Honestly a single example per person for a face recognizer is pretty tough Particularly since the input is just raw pixels and the network has to kind of learn just from very very small raw amounts of data So it can take a few examples per person to have even a reasonable face recognizer But for problems with more complex input features few shot learning or even one shot learning becomes more possible There are a bunch of coding examples online where they do something with a language data set to detect similar characters in an alphabet And I'll actually add these resources in the description below I'm not looking to explain the code here since the blog posts themselves do the explanation quite a bit of justice So check those out. Hope you all enjoyed this video And until next time Bye. Bye