 So, we have seen this paper from Hinton, we have two papers, one is from Hinton, the other is anonymous, but sure. So, and apparently it was a bit complicated perhaps to understand, so I will try to explain briefly what it does, the paper, and then we try to see how we can implement this code in PyTorch. So let's get started. Capsules and routing techniques. So, why do we need those capsules, right? For the moment, CNN can actually deal with translation, but if you'd like, for example, to deal with other kind of affine transformation, this may be an issue. Why is that? So we can think about the neural net as like a classic logistic base neural network, so every non-linearity is a logistic function. So if you think about this kind of very classic neural network, you can think about every activation as the value of a specific feature extractor. So you can think about a neural net as a feature extractor ensemble, and every activation represents the probability of seeing that specific feature in the input. Does it make sense so far? All right, good. So we can deal with translation because we use max pooling, so if things are slightly moved around, we max pool and then we kind of lose track of information about the translation. So convolutional nets are inherently able to deal with translation. But what about all the other transformations? What about like rotation, stretching and shear, all the other kind of affine transformations? So we can deal with those in two ways. So we actually are very aware about this. The first way is actually having very deep networks and very wide networks. Why does this deep, deep, deep neural network work very well? Because they have plenty of feature extractor, right? In this way, they can account for all the kind of affine transformations that are not just the translation. So far, okay. The other way, what's the other way we can deal with other kind of affine transformations? What do you need when you do deep learning? Yeah, a lot of data, right? That's our curse. We need to have exponentially large amount of labeled data, which, okay, it's not too bad because we can augment data, but still we need a lot of detectors in order to account for these transformations and a lot of actual examples in order to have the network to train their own parameters, right? So these are kind of two bad points about standard convolutional nets, which Gapsus actually will try to fix. So how can we fix? We simply have to efficiently encode a viewpoint invariant knowledge, right? Big thermos, but we'd like simply to learn some kind of representation that doesn't care about the specific instantiation of whatever you're looking at. You'd like to extract the 3D underlying object from that kind of 2D view, right? All right, so Gapsus, what are those fancy Gapsus? They are simply groups of neurons characterizing an entity in an image. So really, it's very easy here. It's simply a bunch of neurons together, finished. So you have a neural network? Just make a drawing around some neurons with a pencil? That's a Gapsus, right? No hard thing. Also don't think about nonlinearities. Just a couple of activations, like linear output of the matrix times your vector of the previous layer. Now you make a circle, that's a Gapsus. Cool, right? Okay, let's go on. So why do we care? We'd like every Gapsus, which is representing a specific entity, to be able to tell us something about the specific entity they are trying to represent. For example, those properties may include pose, deformation, velocity, albedo, hue and texture. So every Gapsus has a couple of bunch of activations which are characterizing that specific entity with different properties. So just a lot of words, but nothing here, right? It's just groups of neurons. What's the role? Why do we need the stuff? So in this way, if we can have these kind of Gapsus which are trying to understand about the presence, trying to characterize a specific entity in the input scene, we can try to invert the rendering process. What the heck does inverting the rendering process mean? Let's see. So basically, from a 2D camera projection, we'd like to infer the 3D abstract model that generated the specific view. Still fancy, too kind of weird what I said. You can say, I can understand from your face that it's kind of funky. So I put a drawing here. So basically from the picture of a teapot, we'd like to reconstruct its perhaps geometrical representation, right? So this is kind of intuition behind this stuff. We'd like to infer what is the actual entity of which we see an example here. Alright. So this is enough with text. We see how this is done because it's pretty straightforward as well. So we have here every capsule can be represented with a vector. The angle of this vector represents the property of the specific item we are looking at. Also, another specific way of representing those captures are basically that we are using the length of this vector to represent the probability of finding some specific entity in the input or in the previous layers. So once more, we can think about these captures as like bubbles. I like to think about bubbles. And then you have this kind of vector. The length of the vector represents the probability of finding that specific entity. And the orientation of the vector characterizes the statistics, the characteristics of that specific entity. The capsule is trying to model. So it's the capsule trying to model something which is not present in our input image. What does it do? What do you think is going to happen? Zero. It shrinks down to zero, right? So the length of the vector is going to be just disappearing. And it actually does. I mean, I show you after the examples and it's just ridiculously amazing how it works. I mean, it works well. So we said the norm represents the probability. So how can I enforce this to be a probability? It has to be in the range of... Zero one. Zero one, right? Yeah. Sorry, I usually teach undergrad. Yeah, okay, whatever. So basically we use a squashing function. That is our non-linearity here. The input to a capsule is the S, the vector Sj. And then this is the squashing. Basically, it renorms it to have it at max norm at one, norm one, and at minimum goes to zero. The norming factor, of course, doesn't change the direction. It simply changes the magnitude of the vector. All right, so let's see where this S comes from, right? So first of all, we have this Vj, which is a capsule J. It gets its input from the Sj, which is the input to the capsule. How do we compute this Sj? So Sj, which is the capsule input, it's simply a weighted average of those you had, which are expressed soon. So those you had here, you attribute some weight to each of those vectors. So all of these bubbles are vectors, right? Cs are scalars. Okay. Here the capsule is just V because a capsule has the unitary norm, right? So my V is the capsule, which gets its input from the S, which is like a non-normalized capsule, and we normalize it to get that thing out there. The non-normalized input of the capsule, it's simply a weighted average of other vectors, which are not yet capsules, unnormalized, non-normalized capsules. The use are unnormalized capsules. Do you see the arrows? Yeah. They are wider than the circles. I spent some time making those drawings. Vj is a capsule, though, right? Aha, Vj is a capsule because you squash it. So why is the arrow smaller than the... Oh, because it's a... It's a probability. 2 and 0 and 1. Correct. You'll just join later. Yes, that's correct. All right, so far? Good, right? Good. Questions? No questions. Okay. What the heck are those you had? So you had the prediction vectors, which are generated from the capsules of the layer below. And find the... Those are capsules. You... I? Yes. Are the capsules here? Because the arrows are within the circle. Yay. Also, you can see the orientation of those vectors are different from the orange layer to the purple layer. Also, that took some time. And you, of course, apply linear transformation and you get some kind of rotation and skating and whatever, right? So reiterating, going from the bottom to the top of the architecture because neural networks start from the bottom and they go up. If you draw it on the other side, I'm going to bite you. Seriously, don't draw networks upside down. So I have a question. The U2-1 to the U2-i, those ones are used for all the capsules in the next area. You? The orange ones. The orange ones are the capsules of the layer below. Yeah. So those ones are used to generate... For all the other capsules. Yeah. So they are used for different capsules. But the purple ones are only used for capsules J. For one J. Because it's written J given I. Perfect. Yeah. All right. So from the bottom capsules, which are the layer, the first layer, we apply linear transformation, which is somehow scaling, rotation, some sort of operation. Then from the prediction vectors, we apply routing, which is basically averaging them out using those coefficients Cij. And then we get the input to the capsule layer J, which is going to be providing the capsule output after the squash. All right. Next one. So this is how we learn this stuff. The W parameters, we are going to be learning them by backprop. And what about the C? So C are fancy. So let's look at those C. Cij, and this is the dynamic routing part. So here I will name my green layer L plus 1. And the purple layer, I will just name it layer L. In a specific case, I would call SL the size of the layer L. So how do we get those Cij? Meaning, we just initialize some vectors bi with zeros for every capsule in the layer L. Then we get a softmax of those b's. So given that are all zeros, C's are all one above the number of capsules in the layer L plus 1. The softmax is taken over the j components. If you take Cij and sum over j, so Cij sum over j is equal to 1. So that each capsule would go to something like one of the next layer capsules. Yeah, it's important because this is saying that each lower-length layer capsule is going to one upper-layer capsule. That's what you want. You don't want it to go to multiple ones. That's what this is reinforcing. So it should be the sum over j, which is a higher level capsule. Then we compute the input to the capsule as the weighted average, which is simply an average here. There is no weighting. So it's just an average of all the predictions. Then we apply the squashing in order to get the output of every capsule in the layer L plus 1. We perform this operation, the average weight and the squashing for every capsule in the layer L plus 1. At the end, we update our logits with the projection of the prediction and the output of the capsule. Finally, we just repeat a couple of times. After three times, four times, this doesn't show any more strong variation over this update. I think it's important to talk about the motivation of what this softmax is doing. The softmax is used in order to strengthen just the path from one capsule in the layer below to its own part. So it decides where to route itself through the network. In practice, when you run this multiple times, for each i, Cij will turn out to be very close to one for one j. The other one will be close to zero, which is kind of like routing from one lower layer capsule to one upper layer capsule. These Cijs are learning to do this dynamic route. I think that it's still important to know what this thing is doing is some sort of voting scheme. So all the lower layer capsules vote for something, and then this V is the sum of those votes. And the capsule which has the most amount of overlap with the out-of-aggregate vote gets large. Like the C for that one becomes large. The other ones are citrus. The ones that don't have a lot of overlap with the aggregates of the votes, which is the iJ. So VJ is the output of the capsule. So V here in this case is a writer that is impressed by just one index that is changing. So your next slide, now suddenly V is an index that is impressed by two things, like k and i. So for the example i, I have my input xi. So this is for training, yeah. So I'm training. I'm providing one example. I have an input and I have a label. If the label, if we are at the index with the corresponding to the correct label, we are going to have the first part. The first row of this loss. If we are on the index which are not corresponding to the label, we are having the second one. So we have one of the first and nine of the other. This loss tries to boost the norm of the correct capsule up to 0.9 if it's the positive margin, for example. And it tries to squash down the capsules of the wrong labels down to at least 0.1 if the negative margin is 0.1. So if you are larger than 0.9, the first part goes to zero. If you are below 0.1, for example, if in the wrong capsules index, you are also going to be zero. So the loss function is going to be positive only if you are in the wrong, you have a wrong predictor, right? A wrong margin, a wrong probability. Is there a capsule that has the same number of capsules as the number of capsules? Yeah. So there are as many output capsules as the number of classes. And each capsule norm represents the probability of observing such a capsule. Moreover, if you check the characteristics of this capsule, the angle, you are able to reconstruct the input. And the final margin loss is going to be simply the summation of overall the output, okay? In the case that the correct capsule is larger than 0.9 and the wrong capsules are below 0.1, the summation is going to be just zero, okay? Doesn't make sense. So this thing, all right. So let's see what is the architecture. Here we provide an input xi to the network. We have a convolutional block, which is simply massaging the input. And then we have two layers of capsules. We have a primary capsule, which is a convolutional capsule. And the other one is the digit caps, which are like with the matrix, not with convolutional parts. So the first, the center part is a convolutional kernels. The right one has simply matrices. Moreover, in order to enforce that each output capsule is able to represent completely the input, there is also a reconstruction loss, which is added to the total loss. So the 10 output capsules, they are masked but one, but the correct one. And then from the correct capsule, we try to have three layers MLP, and then you try to reconstruct the input image. This is the loss he's talking about. So the final loss overall of the system is going to be a summation of the margin loss plus a fraction of this reconstruction loss. Anyhow, the roll here is very, very, very, very tiny. Just one over 10,000, I think, or something like that. All right. So these are the results of the paper so far. So here we have the first row is the input to the network, and the below output is going to be the output of the reconstruction. So we can see that the label, so the three numbers on top are relatively, respectively, label, prediction, and reconstruction. So for example, in the second column, we see that an input is the number five, and then the reconstruction actually is a regularizer and removes the extra pin. Moreover here, the tail of the five, which is like curly in the top row, which is the input, is completely smooth in the output. For the third column, we have the gap of the top of the eighth is closed, and also here there is a kind of glitch, which is also recovered. Finally, let's see what's happened when the wrong prediction is made. So for example, in the second-last column, we have that we input a five, which has a small segment on top, which is interpreted as a three instead. But if the network is forced to reconstruct a five, it will actually reconstruct a five. On the other side, with the same input, but the reconstructed character, the reconstructed digit is actually three, the network will remove the top right segment in order to make it look like a three. So here are some of those results. The one that I like the most are the following. So this one, we see that a dimension of those output captures can parameterize the thickness of the digits. And another one, which is another dimension, for example, can characterize like a specific style of writing a number. Finally, this is another experiment they have done. So they input two digits, one above each other. So there are two digits on the same image, overlaid. So the first row and like the top rows are the overlaid inputs and the bottom rows for both cases here are representing the reconstructed inputs. So for example, let's see, top left, we have two and seven. So there is a two and seven in the original image and the network correctly classifies two and seven and it generates two and seven even though the seven has a segment that is overlapped with the number two. Same for the, let's say, third column, we have the six and eight which I really have difficulties seeing them and they are both correctly classified and then reconstruction shows you that even though they are overlaid the network can nicely draw those two symbols. Thank you a little bit in training this because when they are reconstructing the final reconstruction loss uses the separate images. Yeah, I know. But still, I don't know, it looks to me amazing the fact that although they are overlapped, it still is able to figure out. I can't wait to find the surprise. Because it can draw those overlapped pixels and actually assign them to both classes. And in the next case, I'm going to show you very soon. If those pixels are not present, the network will refuse to actually generate the output. Let me show you. So here instead are some misclassifications. So in the first point, we have a two and eight but the network predicts two and seven. And here you can see, if you ask to reconstruct two and eight it will reconstruct two and eight and if you try to reconstruct two and seven it will also reconstruct two and seven. But the part that I liked, it was this one. So if you are actually inputting a five and a zero and you ask to reconstruct five and a seven the network will actually complain, it will not reconstruct the seven. So you see that kind of cloudy region there. The input is the mixed image, the image with the two digits. And then for the reconstruction, the reconstruction is down with the separated digits. So you train with the margin loss with two of those are correct and the other are wrong. And then the reconstruction is down with separate digits. It does seem more of a fair point about the journal of model. Basically you're teaching a journal of model so it has to come up with some representation of the thing that's going to pass the map to twos or whatever. So in that sense it's just kind of pattern matching twos and it's up to the mix of things. But still, if you don't have those features so given that the zero has been, for example, never, so the seven has never been seen so the capsule representing the zero will have a very small norm and therefore it's going to be showing you a very faded result.