 All right, so welcome back to class. Again, I'm kind of excited, scared, and a little crazy today because, well, I decided to change the slides completely. And this is still in progress, so I haven't finished, right? So you're going to get an almost ready lesson today. What happens is that this semester, we taught EBMs in the third couple of weeks, right? So we had first couple of weeks, inference and training of networks, second week we had parameter sharing, so recurrent network and convolutional network. And then the third couple of weeks, we talk about energy-based model, which Jan likes a lot. And usually no one understands anything because we usually teach this later in the semester when you are busy with projects, already done the homework. So it's hard to pay attention to these complex, weird things. This semester, we decided to change completely the order, and we taught EBMs last week and the week before, right? So now I'm supposed to teach topics that I used to teach before, EBMs, with the knowledge that you actually have now about EBMs. And so I'm like, uh-oh, your perspective now is different from the perspective of students from last semester because you have a different baggage, bag of knowledge. And so everything I would have said last semester doesn't apply this semester. And then I had to change my slide, but it was so pain because I had to redraw everything and I didn't finish, of course, but I tried my best, OK? So here we go. Not all animation will work. I fixed this in the editing of the video, so whenever the video will be up, everything will be just wonderful, but you get some almost version, OK, version, OK? And so wish me luck. So what do we talk about today? Cool things. So, OK, let me actually start with today lesson title. All right, so generative models, right? Today I wanted to talk about generative models. I usually spend quite a lot on the motivation. I would try to spend less time and actually share later, latest results that came out like this year, actually. So it's actually a very update lesson, OK? So usually I start my lesson on this topic about with these two pictures. My question is, which of the two images is the real one? If you think the real image is the one of the lady, click the green button. If you think the real image is the one of the dude, click the red button so I can see what you think. Green, lady, red, the dude. Most of you say that the lady is real, the dude is not. And you can clearly tell that you are right in the sense, I mean, the dude is not real because the background is completely funky, right? So the background has this kind of artifacts you can notice here that are clearly generated by a network. On the other case, both images actually are fake. And you can find more pictures on WWW. This person does not exist, something like that, right? So you can go online later and check these images. And these are images generated by neural network, OK? So this is kind of mind-blowing. It was mind-blowing when it came out. Now it seems reasonable or even trivial, but it has not been the case for a long time. We were not able to generate images, OK? So we are talking about today generative models. What are generative models? Models that generate data that lives in this kind of input space, right? In the space of images, for example, or the type of domain that we are used to process with neural nets, OK? So you can lower your sign so that there is no leftover here on the chat, OK? OK, thank you. All right, so we are going to be learning in the next three lessons, basically, how we can generate similar images or even better, OK? Once again, these images really look photorealistic because they observe many different variations of faces in this case. But on the other side, it has been the things that have been changing over time has been the background, right? So the background is less defined because it has not really seen a systematic, what was the word? Stationary type of background. Now faces are the kind of element that are recurring. The background are more varied. So just know they didn't catch on those statistics of the background. All right, cool. So that was the first initial part of this talk. Usually it's, again, used to be surprising. Next one. Oh, yeah, this is from Karas, right? 2019. All right, so one more thing. Whenever we used to do interpolation in computer vision, it means having part of one image and part of the other image, right? And then you try to do a linear interpolation. So what happens, what's going to happen now if I do a linear interpolation of these two images? How am I going to be filling up those missing squares here, the space? Anyone can guess what is the outcome if I do a linear interpolation of these two images? If you have some familiarity with computer vision, can you guess in the chat blurred? I would not say blurred. You wouldn't have a blur result. In a different word, maybe you think about something like the word blurred is not correct, though. Any other suggestion? Just layering the two images, yes. And average pixel, hold on. So layering the two image means what does it mean? So you're going to have basically some sort of translucent effect, OK? So if I do 100% dog, doggy, and then 0% bird, you're going to have 100% dog. Then if I do 10% dog, and then someone rose a hand. What happened? I don't know. If you have 10% dog, and then 90% the bird, you have some sort of translucent thing, right? So if I show you here, you're going to get something like that, right? So this is what exactly means doing a linear interpolation in pixel space. But if I would think, if I would ask you, can you come up with a chimera, which is like half dog, half bird? It wouldn't be necessarily thinking about something that looks like this thing over here, right? So where is my laser pointer? Laser pointer. So this thing over here doesn't look like a chimera. This looks like an artifact, right? I mean, this is not what we think when we interpolate in our brain. Why is that? Because in our brain, in our mind, we interpolate the concept, right? I think about the bird. I think about the dog. And then I, boom, combine them. And I want to see a doggy bird or birdie dog. So what can we do? So if we do an interpolation instead of the pixel space, we do it in the hidden representation of the network. And then I decode that interpolation. Then I may get something that looks like this. So you're gonna get like a dog, a lesser of a dog, a birdie dog, a halfway through. And then it started to see a doggy bird and more or less some weird stuff. And then there's a chicken. And then finally you get our bird, okay? So this is also, they used to be surprising results. Now this was mind blowing. How you actually managed to create a chimera. And you get this kind of weird stuff about chimeras. No, if you, I always remind you that in the full metal item is brotherhood. There was a very nice episode, I believe, where there was like a chimera being made with a dog. Very disturbing animal, but I loved it. Anyway, moving on. So these are part of the, what can be done in terms of generative models. I'm gonna show you very soon what a X undergraduate from NYU has done last month, I think. It's incredible work. Anyway, so moreover, you can also have some semantic changes or actually some specific variations of your produced images, right? So in this case, we have all the possible interpolation between like a Manta, I believe on the left-hand side and a dog on the right-hand side. And in the between, it actually looks like a polypose. Then it looks like a monkey. So it's absolutely, I'm mesmerizing, I believe, how this interpolation from Manta to dog actually goes through polypose, not polypose. How do you call it in English? This is octopus, right? Polypose is Italian, sorry. You go from the Manta to the octopus and down to the dog, okay? Again, lovely. Or you go from this furry doggy on the left-hand side to some sort of squirrel and back to a bird, okay? Or you go from the skank, I believe it's called, to a raccoon, to a dog. I don't know what breed is that. Or this one, you convert a bird into a fly, okay? Again, this stuff is to me mind-blowing. Or in this case, you actually condition, okay? So this is actually a first word I'm using today. We have a conditional generative model in this case, okay? You condition the generation of this image, of this thing on this conditional variable, which is telling you what to change in what direction, okay? We don't just generate stuff. It's not just a generative model, it's a conditional generative model. You condition on specific transformations. Anyway, so you have a zooming of the dog. You have some shifting of a daisy. You have some also shifting in the other direction of a lemon, I believe. Or you can change the brightness of an image and pay attention here. The brightness is not the brightness in terms of luminance, computer vision brightness. It's the brightness in the part of the day this stuff was recorded, right? Because the model has understanding of how this specific first row, this kind of images look across the whole day, right? In the morning, at noon, in the evening, at midnight and so on, right? And so whenever you change the brightness, actually you are changing the time of the day in which you are going to be generating this output, right? Again, to me, I'm like then I don't know if you also are, but okay, maybe you do. Modern major, okay? But you can do, oh, you know, you can react to my idiotic things, but okay, if you like, all right, cool. Nice. On the second case, you have a 2D rotation. So you're actually panning around, right? And the other one instead is a 3D rotation. So you're actually going around, right? So the second one you actually look around and the other one you actually go around, right? And which is like, I also believe it's crazy in the sense that the network is able to map on a 2D grid, right? That is these signals that we've been talking about a possible internal 3D representation of the reality, right? So the model learns how to do a 2D projection of this internal 3D representation, which is again, mind-blowing, but again, maybe I'm easily mind-blown, okay? All right. This was also, this is actually also not trivial. So we've been talking about mostly supervised learning perhaps where you have a sample X and you have a Y, which is like a target and you try to predict the target from the X, right? So you have an X, it would be different type of images. You try to predict in which beam they should belong, right? So you always, we've been always working maybe with vector to vector mapping. In this case here, we are going to basically generate a anime representation given the photograph of a person. We don't have labels. I mean, unless you get an artist doing the drawings of these people, we don't have X and Ys, right? We only have examples of photos of ladies. In this case, examples of drawings of ladies, an anime, a female anime, but we don't have the correspondence. Nevertheless, this model learns somehow, we don't know yet how to make the connection between the two different styles while preserving the content, right? The semantic content. So we preserve the content, the actual subject while changing the style with which it's represented. So it's like a translation, right? You have a language in input, like a natural language or realism photographs and it gets converted into still like same type of signal like a image, right? But on a different type of language, right? And so maybe this is different from what we have seen before, right? Before we were going from one domain, like signals to labels or targets or whatever. Right now we are going from signals to signals. So we stay within the same type of domain, okay? So whenever we generate stuff in this kind of, you know, input domain, let's call it this way, then this stuff is called, these models are called generative models because they generate stuff in this input domain, which I'm not gonna be calling X because we figure very soon what is X, what is Y and what is Z, okay? We are gonna be having some definitions. Hopefully I didn't mess them up, okay? All right, two more examples then we actually start with a lesson. Super resolution, okay? So this is like actually, how do you call it? Stylistic, no, like pictogram, right? Like a representation of what it means, but what it does is basically you input a pixelated image and then you try to restore what is the high frequency content of that image. Like you fill in the details basically, okay? So you provide this pixel whale, no dolphin, I think it's a dolphin and then you get out this one, okay? But this again is just a pictorial representation of what this stuff does. The example I'm gonna show you now is gonna be something from 2009, like, you know, eons ago, so it's not even using deep learning, but that's the introduction to this specific subject. On the left hand side, you provide some very low resolution kind of image. Here it's been up-sampled with a nearest neighborhood algorithm. And then the outcome, what we expect, the model or whatever to produce is gonna be this very nice refined version. It's rather simple because there are only, you know, straight lines or curved lines, but it's like black and white. There is no much crazy things going on. On the right hand side, you can see how the zebra here got its bands or stripes, very well reconstructed and now they look very neat and nice, right? So this looks like someone fixed the, you know, lower resolution input. How do we do this? Oh, there are questions here. We give the leftmost and the rightmost images to the model, right? Yeah, yeah. So in the case of the anime, we provide both distributions somehow the model will learn both and then we have to learn somehow how to associate one with the other, right? But again, we are gonna be talking about this in a couple of lessons from now. I'm just giving you some, you know, cookies, not cookies, like, you know, candies, eye candies, right? So you get hungry and you're ready to learn and absorb what I'm gonna be teaching you soon. Hopefully, if he actually runs and works, again, animation will be broken. In this case, you, yeah, the question was like, how do we manage to reconstruct and fill in the missing information? So in this case, you can see that this is the down sample version of the image or our input. Here there is a up sample version with a bi-linear interpolation. This was the up sample version using the nearest neighborhood. This is the original image. And here you have actually the reconstructed image, okay? Oh, the other way around. Sorry, this is the original image. See, I cannot even tell them apart. This is the original image and this is the reconstructed image. They changed the eye color. So on the left hand side, I see blue or green eyes. On the right side, I see brown or black eyes. And also the skin tone is different. Perhaps it was on a shade and then the model thought it was a darker type of skin, whereas there was perhaps just a shadow. What you notice here is that, you know, whenever I provide this image over here, the third row to my model, given that this model has been observing mostly white dude faces, the Asian dude, actually, you know, instead of being Asian, became a European, okay? So this happens because of bias in the data set, right? The model knows and has been trained on a subset of all possible faces when asked to reconstruct the, to fill in the gaps, he's gonna fill in the gaps with the best of the knowledge it has acquired from the training data set, okay? Similarly, this lady, which has a side view and there were not many side views in the data set, doesn't quite get reconstructed appropriately, no? And seems that, you know, you know, not appropriately reconstructed, let's say. Or some, similarly, since I believe this data set didn't contain many examples of faces with glasses. So in this, like in this case, you can see that the reconstructions looks like, you know, he had an accident or something like that. And the last one was also perhaps interesting how in, I believe the model may have changed the sex of this last reconstruction, right? So it was a lady and it actually become, looks to me a dude. Again, these are just based on what the model training distribution was, right? And then the model just fill in the gaps at the best of its capabilities. Some more here is called in-painting. In this case, in in-painting, we basically remove a portion of the image and then we task the model to reconstruct or to fill in the gaps, okay? So this is our other, you know, possible applications of generality models. Let's say you have like, you're shooting a movie and there are, you know, some people walking in the background, you can simply, you know, select the people and then ask your model to fill in the gaps at best of its capabilities. And it's gonna be basically making those, you know, cuts disappear or like in this case, we have like a rectangle, a gray rectangle over the mouth. And then we ask the model to, you know, fill in this region with the best of its capabilities and that this is the outcome. Again, these are very old results. These are from five years ago, four years ago. And it used to be the case that VAEs were performing slightly worse than generative other serial networks. I think now we can say they caught up, especially with the discrete version of the variation out encoder. We're gonna be talking about these two architectures in the next and the next, next lessons. So again, this is not, you're not supposed to know these things. I'm just telling you that this is when we saw the first actually results working pretty decently. Just again, it was 2017, perhaps. All right, cool. Ha, okay. So, and here we go back to advertising, maybe I don't know if it's advertising. So, I'm gonna be talking about right now very quickly about the last type of generative example, which is gonna be captioned to image, okay? These are results from this year, 2021, right again. I made this stuff yesterday night, the slides. And these are from Aditya Ramesh. Aditya was a undergraduate student here at NYU when I joined as a postdoc with Jan. And Aditya knew more than I did when I had a PhD and he had under, he didn't even finish his undergraduate at the time, right? He finished afterwards, like after, I think, a semester from when we met. And you can tell how far he went, right? He works, he's a scientist now at OpenAI. Again, to me, this is mind blowing. Let me show you. In this case, the model is provided with a sentence description, okay? An armchair in the shape of an avocado. And then it's duplicated, basically. An armchair imitating an avocado. And this is what the model generates, okay? The model generates a armchair in the shape of an avocado. I'm like, what is this magic? This is crazy. All right, then I changed a word, right? Instead of saying in a shape, I say, well, I say a clock in the shape of an avocado, right? So this was an armchair. Now I changed to a clock in the shape of an avocado. And now you've got these other images, right? But instead of a shape, let's say, well, again, I change into a lamp. A lamp in a shape of an avocado. And this is switching, right? So we had a clock, we have an armchair. Clock, armchair. And now I changed to a lamp in the form of an avocado. So this is the form, whereas this was the shape. I don't know what's the difference. I guess semantics, I don't know English enough to tell the difference. But then how about I change the avocado, right? So let's, instead of having avocado, let's use a pig. Oh my God, this is a lamp in the form of a pig. Or a lamp in the form of a lotus root. This is, again, mind blowing, right? And you can also see this stuff by yourself on the website. Dali, yeah. You're still with me, right? Yes. Okay, so in this case, this is also crazy here. So you have an illustration of a baby. Okay, let me zoom a little because I cannot see. So an illustration of a baby daikon radish in a tutu walking a dog, okay? So this is things generated by a network. Instead of a baby daikon, let's have a baby panda. Oh, this is so cute. Or a baby penguin. I mean, seriously, in a tutu, no, let's say in a pyjama. No, a wizard, wizard hat. Oh, or with sunglasses. Okay, this is walking a dog. No, let's change, watching TV. Okay. Again, this looks like science fiction to me. This is crazy. How is this done? Relatively easy. Can I explain it to you? Not yet. You have to wait until two lessons from today. Three, will I explain this to you? Yes, will you understand? Yes. I'm sure, yeah. It's not that difficult once we have all the building blocks, okay? I just hope you, is the text prompt over a limited set? No, it's just a sentence up to 256 words, I believe. Okay, so you can just ask whatever you want and this model will generate things, okay? These are pixel-based, okay? So these are images, 256 times 256 images. Okay, so you should be hungry, crazy, and I mean, I'm crazy, but we knew that already, but you should be at least, I hope, excited to understand and be attentive for the next three lessons to figure out how this is even possible, okay? And again, it's gonna be, we build this step by step. Hopefully, I make slides on time, okay? All right, so let's move on and we start with today lesson. Definitions, okay, why do we have definitions? Because otherwise, we don't make sense, okay? So what are X, what is Y, what is Z, okay? So X is observed during training and testing, okay? So X is always there. When you do classification, X is gonna be your image. It's gonna be there during every time, right? What is Y? Y is gonna be observed only during training. We try to predict Y during testing, okay? So Y is only observed during training. Supervised learning means there was a human generating annotations. So there has been someone, a person, creating these pairs, X and Ys, okay? Which are, again, human annotated. During testing, we don't have Y, right? For example, again, in the image classification, we don't have the labels at, you know, inference time. We just try to predict the label given that we have the X, the X. Z, instead, is the latent variable. It's never observed, nor during training, nor during testing. We have no idea, but it's an input variable. We have control over, and then we may find by, you know, minimizing the energy of the system, okay? And so let's get a recap about how we were playing with these X, Y and Z last lesson, last week, okay? Okay, X doesn't have to be an image, or Y doesn't have to be a label, right? In the case that I just showed you right now, my X is the string of text, okay? In this last case, no, in the lotus root thingy here, right? So in this case, my X is gonna be the text prompt, a lamp in the form of a lotus root, a lamp imitating a lotus root, this is my X, which is a set, or I would actually say it's a sequence in this case, right? Because it's a sequence of possible tokens, right? That is, you know, making up this text. So it's a sequence, my X. So we talk about sequences a few lessons ago, right? So you have curly brackets, then we had X square bracket T, I think. Actually, yeah, we have X square bracket T. And then we had basically, we were talking about the recurrent network, right? So in this case, my X is a sequence of tokens, discrete tokens. In my Y, in this case, my Y is an image, okay? Sometimes target, it's not a label, but it's just my Y, okay? Perhaps, yeah, we can call it target in this case, actually. So last time we talked about this one here, no? It's gray out because we already talked about. This was how we were generating that horn looking like thingy, you know? Given that we had a given X, so moving X, we were moving across this, you know, horn. And then given the latent variable, the input, my latent missing input, we can generate all possible points around this, you know, possible ellipses. Yes, my screen is dim because we already observed this one. And so here we had all these X, Y, and Z, right? So X, in this case, this was a conditional EBM, latent variable and model, right? And then we talk about this other one. In this case here, we also saw that this was the unconditional case, right? We didn't have X. Y is still our target, right? But we no longer conditional on X. Or in this case, X was chosen one, okay? So in this case, I set X to be zero, so we don't change it. And we just try to model the Y distribution, okay? But that was, yeah, pretty much it. So how did we train that model, right? So I'm just going back in the unconditional case. So unconditional case means there is no X, okay? We don't have any thing during evaluation. We only have our Ys. The Y, we use them during training, right, we said. But then during evaluation, we don't have anything. And we don't have Xs, right? So that's why here we don't have Xs. And so how do we train this model, right? So to train this model, we take an observation, Y, okay? We were computing the free energy. Basically trying to find out the Z, which is giving us the Y tilde that is the closest to my Y. And then given that we have found this free energy, which was the mean or soft mean of the E, then we were minimizing that loss functional, such that that energy, free energy, was low for my observed sample, okay? And we know, we repeat this so many times, an energy is well-behaved when it is low on good samples and it's high otherwise, you know? And there are two ways to do that. One is the contrastive method, which is pushing down the energy on my good samples and pushing up the energy on the bad samples, that it is the contrastive. Or in the case we covered in class last week, we had like an architectural case where we chose Z to be only one-dimensional. So it can only vary on one dimension while the Y covers two dimensions, right? And so in this case, we limited the amount of regions that the energy can have lower, can be low, right? Because it's just on a line. We curve the line, but it's still on a line, right? And so this was how we were training an energy-based model. I think I hope you remember, okay? And now it's no longer dimmed the screen because this is just training recap. We didn't see this slide before. So we start now talking about something called, let's see. So this one is the diagram we know, right? So far, it should be okay. Let me actually remove those things. So this model has only a decoder, okay? This model doesn't have a forward path, right? So there is no input. There is no, well, the energy-based model has an input which is the Y, but in the classical sense, no? There is no X provided to you. We only have the Ys, only the targets, okay? Try not to get confused about this. So what happens to this Y? So first thing, we had this H here appearing. And then basically we have to compute H through a mechanism. Before we were finding Z, how did we find the optimal Z? The optimal Z, the Z check, was the Z that was minimizing me the energy, okay? The E. So we were getting Z check by minimization of the E. And then we were training by minimizing that free energy. In this case, we're gonna be using a encoder over here to come up with my intermediate code H, okay? My hidden code. And so in this case, we use this encoder over here to perform something called amortizing inference, okay? That, thank you, Arthur. So here, instead of computing the minimization of the energy E with respect to Z, we actually have a encoder, which is approximating this minimization. And it's providing me an H, a hidden representation, given my Y, okay? So I have my Y, I encode the Y. This is the first time we see an encoder, right? So before, when we saw this other diagram, we had a predictor, right? So from X, we predict a hidden representation, which got decoded into the Y tilde. In this case, there is no predictor. In this case, there is an encoder because we already have Y. So we encode Y into my H, my hidden representation, and then we decode H into back to the Y tilde, okay? I hope it's clear. So my equations are kind of the same, so it's not a big deal. So we have that the hidden representation is gonna be my squashing function of the rotation of my input, the Y, right? So the Y is the observation. Then I have that my Y tilde is gonna be another squashing function G of the rotation of my hidden representation. In this case, Y and Y tilde, they both belong to this RN, this input type of space. And then the H is gonna be my living in RD, which is this internal representation. Again, WH and WY are gonna be the matrices for rotating this stuff, so no big deal, okay? All right, so this was the out encoder. Big deal is that we have a module now, this encoder, which is performing this amortizing inference, and we no longer have to minimize this E energy. Actually, if you can see now, there's an F here, right? I hope it's fine. So here there are two examples of reconstruction's energies. The first one is gonna be simply the square, a clean and distance between my observation and my guess, my Y tilde. And what is this Y tilde? Y tilde is the decoded version of the encoded version of Y, right? So encoded Y is H, and I decode the H, I should get Y tilde, then I compute the difference Y minus Y tilde and I take the square norm. If I have binary input, then I simply compute, for example, this binary cross entropy between each element of this target, well, this Y and the Y tilde, okay? How do we train this stuff? Well, still the same way. So we have a loss functional, which is, for example, these average across all training samples of this per sample loss functional, and then I could simply take the energy loss. No, I tried to push down the energy on this Y. So question for people at home. What is the objective of an out encoder? Okay, let me actually go back here. You should be able to answer this question before I tell you why, what, right? Why do we want to have an out encoder, no? What is the reason that someone would want to use an out encoder? Can you guess? Okay, so people here on the chat are suggesting dimensionality reduction. Dimensionality reduction could be possibly one application, but that's definitely not what we are interested here. I'm gonna go, I'm gonna be talking about that soon, but again, it's very, this is what I used to be thinking whenever I think about out encoders, dimensionality reduction, but again, we don't actually care about out encoders for that specific reason. Let me see another answer here. Okay, many answer, let me enlarge this thing. So get a lower dimensional representation, no. So yeah, I mean, I understand what you said that is dimensionality reduction, but no. I mean, yes, you can, but that's maybe just one application. You are correct definitely in saying that, but I'm actually going somewhere else right now with a reasoning process. Reconstruct why is what we do during training. Learn the latent, I guess, hidden code. That is correct. Learn a lower dimensional representation, not necessarily. So we want to learn some code, right? Some representation doesn't have to be low dimensional. Actually, we may even want to have it high dimensional. What? Okay, remove noise, yes. Remove noise is a good option. Okay, so forget about what you heard about out encoders before, and now actually think about what you learned in last class with me and Jan, right? I mean, last two weeks. What are these energy-based models? What are they supposed to do? How is F supposed to behave? Tell me. No, no, there's no more minimization here, right? We define that F is gonna be, I show you here, right? This stuff over here. But if you have an energy-based model, right? Okay, there you go. Camilla, hold on, hold on. Yeah, so I'm reading Camilla's answer. So a good energy should be low for good samples, high otherwise, okay? So the reason that we want to learn an out encoder, one of the reasons we want to learn out encoder is to rank, let's say if you want, there's no ranking. It's like to express how good a given Y is, and how do you do that? By learning what is the internal structure of the Y. And so what does an out encoder do? An out encoder learns the structure of your input, this case, the Y, and encodes it in this H, hidden internal representation, which is a code that is expressing your input, which doesn't necessarily have to be smaller than the input. Then why on Earth would we want to learn a larger representation than the input, right? This is getting crazy, I think. And so here we go. Under an over-complete hidden layer. In the left-hand side is what you actually correctly pointed out a few minutes ago, that one possible application of out encoders is to get a H representation of my Y, which is a smaller in terms of dimensions, which is like performing basically dimensionality reduction by using some non-linear transformation. If you think about PCA, that's simply a linear dimensionality reduction technique. This is a non-linear dimensionality reduction, which allows you to perhaps get a better representation. On the other case, I would argue, and that it's actually better to have an intermediate representation, which is larger than the input. But now there is a problem, right? What is the problem? The problem is that how on Earth am I supposed to be able to train this model and not have it collapse? So what is the collapse of an energy-based model? Tell me, if I forgot, tell me. What is a collapse of an energy-based model? Zero everywhere, fantastic, right? So if you have a hidden representation that is larger than the input, the model could simply transfer this value here, transfer this value here, transfer this value here, then transfer this value here, transfer this value here, transfer this value here, and you have a perfect subtraction, zero. Every type of input gets zero perfect reconstruction. Hallelujah, awesome, fantastic. We have an autoencoder that is able to reconstruct everything. So my question now for you, which actually you can actually answer, I believe. Yeah, what is an autoencoder that can reconstruct everything good for? Answer me. If an autoencoder can reconstruct every possible, yeah, if an autoencoder can reconstruct every possible input you present to it, what can you use this autoencoder for? Nothing, right? If everything you provide, if you have an identity metrics, okay? So let's have an autoencoder which is an identity metrics. You provide a vector and the thing gives you back the same vector. Any vector you input, you get same vector output, in output. Can you use the identity metrics to do anything? No, exactly. So an energy-based model which has collapsed, you know, it's flat, or a autoencoder that can reconstruct everything, which is exactly the same thing I said I just repeat myself, are useless, okay? A autoencoder is good only as long as it can reconstruct the samples that have been observed during training. Or in the other way, an energy-based model is good only if it's not zero everywhere. Because otherwise, like, you know, it's a flat, what could it be called? A flat, what's it called in English? A plane, no, yeah, but the one you were in in Africa, what's it called? Flat, you have mountains and then you have the grassland. Prayer, yeah, there you go. Okay, okay. All right, it's no surface, no prayer, right? No valley, I guess grassland. It's a completely grass, you know? You can't tell anything, right? You want to have mountains, otherwise it's boring, okay? Is it possible to avoid collapse? Yes, thank you for asking. Yes, of course. So on the right-hand side, how are we going to be using the right-hand side technique, right? There are many ways you have to find a way to constrain the amount of regions that take zero or low energy value, right? That's what we've been learning so far. So we need to introduce some regularization, right? So we have regularized autoencoders. Example of these are like sparsity. We can introduce a sparsity constraint over the hidden code, such that the hidden code has only a few units that are not zero. So zero, zero, zero, poof, zero, zero, poof, poof, zero, zero, zero, zero, right? We may add additional noise. I'm gonna be talking about that very soon. And then we may have like sampling that can also be used to help us. All right, so in the next, okay, I will be going, oh, I cannot even go over time because I have another lesson going on. Oh my God, okay, this is so bad. All right, what time is it? 1024. Okay, so we're gonna be talking now about denoising autoencoder. This is my module I showed you before. I have my encode, decode, and the y tilde. And then how does denoising autoencoder works? Well, I have, I take my y, I corrupt my y, and I have this y hat in red, okay? And I will, I want that y hat has a high energy than this y, okay? So now we basically, a denoising autoencoder, it's a contrastive technique, basically. I create my bad samples. I take my good sample, okay, there we go. So I have my good sample, the blue. I add some noise. Let's say I add some random Gaussian noise. I have, this is my training manifold, the line. I have a few examples here showing in blue, with blue dots. Let's say I have my original y in the center. I kick it out with some corruption. I'm gonna get my y hat. How is supposed to be y hat energy with respect to y? Larger, right? Y hat has the hat means you push up the energy there. How do we do that? Well, when we train this guy, we enforce that the output of the decoder, y tilde, is going to be close to my original y, regardless of the corruption. You see this, right? So my y is my target. I corrupt my y and I have this red high y, which goes through this encoder, decoder, and the output, it's still my original y, okay? I mean, y tilde, which is, you know, should be made close to be y, right? So I have my y, blue y, I displace this y over here, but then I force my model to reconstruct it in the original location. Again, I have my original y, I pull it here, and then I reconstruct it back there. I have my original y, I pull it here, and I force the model to reconstruct it back here. So I'm gonna be learning a vector field that is bringing me my displaced input back to the original location, okay? We cover the notebook next time because it will be not possible otherwise it will finish this explanation. So we said we were here, we had the training manifold, we had a few of these samples across this, you know, I think I have my original y, I displace the y, I get my y hat, and then I enforce the denoising process, which is I enforce my y tilde, which is the output of the system, to be attracted back to the y. So I'm gonna show you this one, oh yeah. We assume that we are injecting the same type of noise we're gonna be observing later on when we observe corrupted input. So in this case, I show you these points blue. I take these original points, I displace them, and I get this red or turb one, and then I enforce the system to go back to the original location. Now I take every possible point in this plane and enforce the network to put them back to the original location on this spiral. So here I'm just showing you this quadratic distance, right? Between the reconstruction and my original points. So if the points actually are coming from this region, well, they didn't travel much after the reconstruction. Therefore the energy term F is very low, close to zero. That's why it's purple. Instead points that were far away here in the corner gets dragged down a lot. That's why the distance they travel is much larger. In this case, it's one, right? And so here we have been learning an energy function by training at the noise and out encoder, which is basically getting these points that they get displaced back in place. As you can notice here, in the central region, points were sometimes displaced towards this region and then points were displaced during this region, right? Because this one was displaced up here and then put down here, but then also this point was displaced up here and then put down there, right? And so over there you see there is a region flat on top such that there are no gradients going in any direction. Otherwise, so you have a mountain, but the top of the mountain actually it's flat. I fixed that by actually now enforcing the system to be attracted by the closest point on the manifold, okay? So from here, here I just take my original point, I displace the point and I force the system to go back. I take the original point, I displace the point, I force the system to go back. I take the point, I displace the point and I force the system to go back. In this other case, I take my point, I displace the point, I check which of the points is the closest one and then I go back down then. I take my point, I displace the point, I check which of the points is the closest to me And then, boom, I go there, okay? So in this case, I try to fix that kind of flat top, such that now it's like a crevice, I believe it's called in English, right? It has an edge rather than being flat, okay? So it doesn't no, it does no longer go back to the regional location. Then I try to train this with the sparse auto encoder. I was not able to succeed. You can see here some purple regions across the manifold, that was the best I could have done. I couldn't get it to actually train properly. And I have to really stop here because I have another class starting one minute ago. Thank you for being with us, with me. I will not be able to answer the questions. The notebook, I guess I try to cover the notebook next, I will cover the notebook next time. Hopefully I get also the animation fixed, such that the colors are gonna be the correct one, right? So my points are the blue points, right? They are blue and then the displaced one are red, they are hot, they do have high energy, you know? And then the output of the model is this kind of violet color. The slides will be uploaded when I finish the second course. Thank you for being with me. Thank you for being patient, especially today with this new kind of lesson. I hope you were managed, you were able to follow. If you weren't able to follow, check the previous videos on energy based models, because again, we are building on top of what we've been covering. And so I felt there was like the need for me to change the lesson based on your current knowledge. And that was pretty much it. Again, peace, take care, bye. I'm switching to the other course if you want to join, let me know. Bye, stop sharing.