 One of the features of deep learning that I find fascinating is that they use continuous approximations, differentiable equations you can do gradient descent on to approximate a world which is discrete, which has separate objects and separate facts in it. In some sense they are magic. It's amazing that they actually work. You can take a neural net and it can approximate a piece of C code or natural language, processed natural language where you have sets of sentences, you have words which are discrete, you have relationships between them and note that they're doing something that is slightly implausible. Discrete optimization is NP hard. If you're trying to do something which you say this and this but not that then solving the optimum for most of these discrete problems really hard. However neural nets magically do gradient descent. They just go smoothly down, down, down and very efficiently find approximations to these really hard discrete problems. Learn to invert a list. Learn to recognize that this is here and that is not there only if this is there. Logic sort of pieces. And it's important to note that because you're approximating a hard problem within its approximation, the approximation will guarantee be imperfect and this is going to drive a lot of future research because you can get 90% 95% good approximations with this continuous approximation, this nice differentiable smooth function, but it is not going to be able to solve everything. There are problems that are provably too hard to be solved by these sorts of methods in at least the sort of time we do. So the big research question, can we somehow glue in discrete representations, symbolic representation of the world? Now maybe this is not necessary. Inside your head, everything is continuous. Well actually underneath the continuous representation is a discrete representation of neurons firing boom, boom, boom, boom, boom. But we think of neurons as being a continuous. We don't actually have objects like houses and screwdrivers stored in our head in a place you can latch on to. But somehow we have this behavior as if we could recognize things like screwdrivers. So how does that work? How could we do that? Let's look at one attempt to try and do this, the cleverer system. It's a system that generates synthetic movies. These movies have little squares and balls bouncing around knocking off of each other. You then show the movie to the computer. So it gets a sequence of images and ask it questions like how many spheres are moving or which one of these things is responsible for the gray object being knocked or prediction, which is what most of neural nets will ask, where will the blue square go or things that are hard for current neural nets to answer like counterfactuals? What will happen if the gray sphere is plucked out and removed? And to answer questions like this, it's hard to work purely with a standard sort of CNN or LSTM sort of model. The idea that these researchers are proposing is to use neuro symbolic reasoning. This is one of many pieces that do some version of this map from an image to a discrete representation of objects. In these pixels, there are red squares and green squares, objects that have attributes on them. They have shapes and sizes and colors. They also have relations, something very symbolic. How many objects are to the right of the red object? They have visual cues. Are they of the same material as this or that? So this idea is to merge together a deep learning model that maps from pixels to a discrete representation when it's not built in, right? That's learned of objects and attributes and relations between them and allow that then to answer questions, right? You can take the questions, use our standard transformer model, map those to facts and try and somehow go from the continuous world of videos to a hidden latent discrete objects with attributes to then answer questions. Quite cool. So how does it work? You watch a video, the computer does. It's got some sort of a mask convolutional neural net that parses the frames, recognizes objects. It has an LSTM encoder. The LSTM learns to recognize what are the objects? What are their colors? What are their shapes? It then pulls that into a little model that it uses, right? We're moving toward causality, a model of the world. They're discrete things that have dynamics. They move at certain speeds. They keep moving. They bump that then takes the simulation here, run simulations that predicts what will happen for these pieces. Given the prediction, you can then take a question and a prediction, run that to a question answering machine. We've seen versions of these in this course and then answer what is the shape of the second object to collide with the gray object and it says it's a Q. So note this piece of parsing the text into objects and attributes, parsing the image into objects and attributes, modeling them. This is what is happening going on under the hood in a structural causal sort of sense and then answering questions. One example, very much a toy system, but it captures the spirit of trying to recognize that the world is composed of objects, attributes and relations between them, but not being purely symbolic. There's LSTMs in the text analysis. There's LSTMs or CNNs in the image analysis.