 it's good to think about what are the tricks, the ideas that make our neural nets currently work as well as they do, and how we might extend them to make them work even better. Of course, there's gradient descent, which is sort of magic and works great, but one of the key insights we've had over the last decade is how to build invariances into neural nets. And we've done it in a number of different ways. Convolutional neural nets actually note that there are local regions that are consistent, that there's translational invariance, the same sort of feature detectors work across the whole image. That is something that's there. The structure of the visual space, there's more local correlation than distant correlation, allows one to cheat by building in this prior, this inductive bias, this expectation, this invariance. And that lets us learn much more efficiently. Similarly, we've used the same sort of idea when we've looked at recurrent neural nets or LSTMs, that there are local feature detectors that exist, that there's local correlation structure across words or subword embeddings, that the world is close to Markovian, that you don't need to know as much about things that happened a long time ago, and that in some sense there's translational invariance, that if you look at one sentence early in the document or one sentence later, they're mostly pretty much similar with minor details. People also are doing things which we haven't done in this course. They're building in more physics to the neural nets. Since the, at least the 80s when I was doing it, people have built in facts about the world. Energy is conserved, mass is conserved. These are in variances that you can build in as structure to a neural net. And again, learn more efficiently and extrapolate better to things you haven't seen. We've also sort of implicitly used these variances by pre-training. Often we'll train up a CNN on one set of images and rely that those sort of feature detectors will work on other images. That's an invariance. Or we train NLP on huge billions and billions of sentences and trust that the future sentences we see will have some of the same structure between them. So key to making almost all of our deep learning scalable is building in the right kind of invariance. Now people, including Jeff Hinton, who is sort of the inventor of, well, almost all the modern deep learning and convolutional neural nets, is capsules. I'm not going to cover anything this week in enough detail. We're not going to code it up. But I want you to sort of see where things are going I'm going to warn you that every couple of years Hinton changes his mind as to what a capsule is. But the idea is what I want to get across. And it has this idea of equivariance. Not invariances. Things are exactly the same everywhere. But equivariance is that if you change, for example, the direction at which you're looking at something, that there is a nonlinear change in the pixels. But there can be a linear change in what sub-objects you're looking at. An eye, a nose, a mouth, a car wheel. That there's some sort of fairly simple variance under the hood. And the idea behind a capsule is to try and recognize one particular sub-object. Some piece of a screwdriver or something, or the tip of a screwdriver or finger. And note that as you rotate it in different pieces, within some limited range of angles, you should have some little module, like a filter in CNN, that recognizes that piece of it. And you have a whole bunch of these capsules, each of which is recognizing how likely is it that there's a finger here, how likely is it that there's a screwdriver headed here or a piece here there. And output some information about it. What's its pose, its angle, what's the lighting, and maybe what's a canonical representation of it. So the idea is lots of little modules that are much smarter than just a simple convolutional filter, but have some of that same sort of flavor. And also note that as relationships, usually eyes come in pairs, usually they're above the nose and below the hair or shiny spot on the top. And the question, and Hinton and others are exploring, this is how do you modularize these things so that a modular is sort of independent, captures enough information about it, but richer and smarter. CNNs just do a bunch of little filters and then take a max pooling. What's the maximum that I saw of these little filters? Whereas capsules try and say, okay, here's a bunch of independent quasi-independent pieces of information. Let's put them together in some more sophisticated fashion. They work sort of, they don't work entirely, but they are an exciting direction of trying to build in a more sophisticated class of invariance or equivariance into deep learning.