 Welcome to the third class of the Deep Learning 2022 Fall Edition, Fall semester. So we cover a few things so far, a bit of history on the first class. Then yesterday we cover a little bit of back propagation and gradient descent. And then I actually pointed out in the chat that gradient descent and back propagation are not exclusively used for training. So we're going to be learning about that today a little bit more. But today, so today topic is going to be only about inference. What is inference? Inference is about using neural networks. Just you're given a network and then we're going to be learning today how we use it, how we make predictions. Before that, a few announcements. Why is that? Because I like to tell you the things I like and you might like the things I like as well. Usually you do. I mean, most of the time. So one of the things I really like is going to be this book from Edward Taft. He actually sent me all his books. And this is called Beautiful Evidence, which is basically talking, well, it's a book about quantitative visual art. This is the title. You can see. There's some glare. OK, you can see, right? Duck. So it's super nice. And it tells you basically how to properly convey information with a visual type of media. So you should be able to convey information with different type of media. You don't have just to provide text. As you're going to see in my classes, I like a lot to provide mathematics with colors such that I believe at least it's using the way to make it. It makes easier to make connection between the charts and then the diagrams and the mathematics. Anyway, so that's one thing I wanted to show you. Another thing was the I told you last time to have a look about the introduction to linear algebra or whatever linear algebra animated by Glenn Sanderson or three blue, one brown, right? I really encourage you to binge watch his videos on YouTube such that you can start acquiring a visual understanding of how these mathematical operators behave, OK? Such that you can reason in terms of pictures, in terms of images. That's how at least I find it very convenient to think about things visually. About that, there is a blog post, which I really, really, a couple of blog posts from this guy, Gregory Gunderson. He has this amazing blog site, blog blog. You click on blog and then you can click on at the bottom. There is like show all posts. He's a New Yorker. So you're going to see also things about New York. But the two things I wanted to show you here is the matrices. Where is it? A geometrical understanding of matrices. This is, I think, really good. And it shows you basically how to think about matrices. So after you watch the videos from Grant, I would really recommend you to read through this blog where you're going to get some additional intuitive understanding of how these operators work. And actually, we're going to be covering some of these things today as well. Another blog, which is super nice, which I also got inspired by, is the singular value decomposition here. So who doesn't know what is singular value decomposition? Tell me in the chat. Is there someone who doesn't know? I know. Of course, you won't say, oh, I don't know. I don't know. The point is that one thing is the mathematical introduction of the singular value decomposition. And if you don't know about that, I mean, you should because you should have already taken linear algebra. But let's say we might have forgotten, which is possible. But again, if you forget something, means you never understood it. If you understand something, it's part of you. So you cannot forget understanding. You can forget knowledge. But knowledge and understanding are different things. So again, if you go on Twitter and then you type, as I told you last week, so you go from me, r c n z, and then you do singular value decomposition, you find, oh, OK. Actually, this is the interpretation I will tell you about today. But if you scroll down, there's a preview from the book. OK, too much stuff. OK, here. So this is a lecture in the mathematical style from Gilbert's strength. So here you have singular value decomposition. And this one is called diagonalization of matrices. So what is another word for diagonalization of matrices? What kind of decomposition is this one? Type, type in the chat. Type, type. Eigenvalue decomposition, right? So do we know what our eigenvalues are, eigenvectors? What could be a quick description of eigenvector in the chat? What is the physical intuition? What is the geometrical intuition behind eigenvectors? Do we know? Do we not know? Direction of maximal variance? Absolutely not. OK? So that's the main point I want to stress, right? So thank you for pointing out the fact that you say something that is completely wrong. It makes me understand that the things I'm trying to teach, perhaps, will give you something, right? So vectors whose directions do not change with the application of a matrix. Yeah, that's correct, right? And so I would actually say that there is no eigenvector. But I would actually say eigendirection. So all the vectors that are on that line, that direction, will not change their direction when after a linear transformation, OK? And so again, I wrote a bit about that. So if you go, again, in this case, on my blog post, on my blog, I mean, why not do an advertisement first time, right? So you go on the blog. And then here I talk about a little bit of these eigenthings, eigenstaff. It's the only article I've written. But here I basically introduced this eigendirection rather than eigenvectors, because I don't like the vector part. And then basically here you select two eigendirections. And then given that you select them, you can drag the eigenvalues, right? You can have 1, 2, minus 1, minus 2. So here you change the eigenvalues. And then before we just chosen the two eigendirections, which are these directions, these lines, right? And this is written in JavaScript. You can also check the code if you click on the link here, right? But also I explain to you how this works in the part below here, OK? So again, this is, again, yes, it's a little bit of advertising what I'm doing. But the point is that you should be getting a visual understanding or geometric or whatever you want to call it, understanding a physical understanding of all these mathematical things we are using. Otherwise, it's really hard to really have like an intuition about what the heck these things are, right? Especially there is a lot of confusion arising by using terminology which we don't quite have a inner understanding, right? Another question that I asked, I asked also Jan, and he got it wrong, so I get some pleasure in that. My question is, do eigenvectors need to be orthogonal? Yes or no? Type in the chat. OK, so for those that type yes, I'm asking. So subsets, OK, fine. Stop typing. Asking those who type yes. Are these two lines that are drawn on the screen orthogonal? Just those that type yes before? Yes? Wait, what? Did you see these lines, right? You see my screen. There are two white lines, right? Or gray lines, this one and this other one, right? Orthogonal means, are they at 90 degrees? Are they intersecting at 90 degrees? The answer is, I believe, obviously no, right? And so the answer to my question are eigenvector orthogonal? It's no, of course not. I just show you on the screen, these are not orthogonal eigendirection. So again, please, again, no shame in answering incorrectly because, I mean, we are in class for learning. But the point is that if you find yourself doubting about the understanding of something, just dig further. So again, read through these little plots and you will see that it doesn't really take that much effort to understand these concepts. But again, you need to actually put yourself up to these things, right? One more thing, and then I stop with these other types of reasoning, advertising, otherwise we don't go anywhere today. It's going to be the mathematics for machine learning, OK? This book by a lot of people, but I only know Mark in person. So this book is available for free online, right? So you just go to this link. And then you should really read the vector calculus for next week because you will have to do a homework. And this is really required, right? So I mean, you should, in theory, already have this knowledge from previous classes. Still a refresher wouldn't hurt. OK, just have a look to this one, right? This one, the two videos from Gilbert, right? Wouldn't hurt. The blog post, I mean, this is maybe less necessary. But again, if you find this stuff kind of rusty, you're self-wracking, these topics have a look. I really also recommend checking out this one, right? The geometrical understanding of matrices. And finally, the introduction, the YouTube one, right? Where is it here? Linear algebra, linear algebra, three blue, three blue, one brown, right? So there is like a playlist, a sense of linear algebra. I just watched all of this, right? I mean, recommendation. You're not, don't have to, but this is all I can tell you to. I can try to help you, right? Catching up with all those things that might be a bit, a bit, you know, in the back of your mind, OK? So also, again, here you can look for here from me, right? And then you want to look for probability data science, OK? I think this is also a very good one. This book from Stanley Chang, it's really highly recommended. And it explains probability for a data scientist, so with codes and interactive way. So sometimes I see, although we won't be using much probability in this course, in other courses, you might need probability. And sometimes it's hard to get your head around these topics. So just, you may want to check out these resources, OK? And now I'm done with these resources, these advertising. I hope this is helping someone, at least, right? If he helps one person, I'm glad. But there were a few of you who got the eigenstaff wrong. So, you know, Jan included can gain something by checking out these resources. Anyway, moving on, let's close this thing. What are we talking about today? We talk about, we said, what do we talk about today? Remember? No one say anything. Inference, yes, they're correct. All right, so this is, it's called draw.io, or now they changed the name into app.diagrams.net. But you can just type drawdraw.io, OK? And press Enter. You're going to get to the same place. What is this thing, right? So finally, we see here, somehow there is a module, right? We said that we use this kind of bullet-shaped, like, figures, shapes, bullet-shaped, you know, a bullet shape for representing a deterministic transformation of something into something else, OK? So in this space, in this specific case, I have a X here, which is just denoting a arbitrary input in this case. It's not necessarily a data point yet. So I have an input here that goes inside this thing, and then it gets a specific output, OK? So in this case, I have this circle here, which is shaded, right? You can see it's different color, the background from this one. So this means it's observed. So I provide an observation to my predictor, which is spitting out a Y tilde, OK? So what equation would we write to represent this thing, right, in mathematics? How would you type in the chart, right? How would you describe this diagram in mathematical terms? So yeah, but yeah, there you are, right? So Jack said what I wanted to point out. So you have Y tilde equal F of X, right? So the order of the items are all flipped, right? So Y tilde, which is the extreme right, is going to be the first to the left. Then you have the, it's OK, you don't have to keep typing. Then you have the predictor, which is going to be still in the middle, right? But then it's operating on the X, which is on the, well, the X is on the extreme right, right? And then finally, on the extreme left, you're going to have the Y tilde, right? So things are flipped, I believe, right? At least in my opinion, right? So this is going to be called the forward pass, right? You put something inside the network and these don't speed something else out. But again, the order of things are inverted, which is somehow counter-intuitive. Question. So if this is, we are trying to do maybe regression, we expect this Y tilde to be how, with respect to my ground truth, Y, far close, close, right? And so, oh, equal, hopefully, yeah. So this tilde here on top, although the font is broken because I don't know whatever issue the website has right now, so the tilde means it's like an approximation for the Y, OK? And somehow there has been a process, which is training, but we don't talk about training today, which has found the, well, has optimized or has given us the parameters, the weights, for this specific model, such that it converts the specific input X in something Y tilde, which is something close to my target, OK? Anyway, so how about I have something else, right? So in this case, over here, what's the major difference here? Despite the font being broken, I don't know what, yeah, what X tilde, I have an X tilde here, and also the circle is no longer shaded, right? You see? And the other thing I have is the blue Y, right? So now the Y is blue and also has the shape. So in this case, I'm given the Y, but I'm not given the X. So how do I get the X out of a model given that I only have the output? This is also called the inference. So we can always go, we can go in both directions. Once you, this one is called the fit forward, OK? And in the fit forward, you feed something to the network, this model, this predictor is the model. You feed something in and you spit out something, right? But again, I can't really reason in terms of mathematics here. It's ugly and I can't really reason. In this case, I can reason, right? I feed something inside and this one spits out some sort of prediction. How about this case? I have something that I would like to be this value. How to find this one, right? So someone say, yeah, invert the F, right? Or invert the predictor. Let's say the predictor is not invertible, right? So in machine learning, that's actually what we do very often, actually, the case. So most of the time, what we are actually doing is actually this one is going to be our parameters. And given that we observe both the X and the Y, we somehow find which are the parameters that allows us to connect the X to the Y. Anyway, so what we are going to be doing is going to be the following. We're going to be trying to find, so this X tilde is going to be the arc minimum of all possible X's. So I'm going to be trying a bunch of X's. My output is going to be swinging, right? The Y tilde is going to be swinging until I get the Y tilde that is minimizing this difference. So whenever I'm going to get that Y tilde that is the closest in terms of square distance to my target, then I found the X which is producing that specific Y. Is there one single X? No, no one said that. Is there an X which can produce that? Not either, not necessarily. I can only just find the best X, given the specific network configuration, right? It doesn't have to be exactly the same. You don't have exactly to have zero in terms of error. You can get the X that is giving you the smallest error. So we infer the predictor from X and Y. That will be the training. We talk about that next week. This week, instead, we're talking about inference. So the inference, there are two types of inference. There is the inference that you achieve the easy way, like you just perform a fit forward, meaning you put points, you put a specific value at the input of the network. The network speeds out something. Say we're going to be playing with this in a bit with the code. Or the other way to do inference, but we won't be doing that with the code today, is the other way around, right? We basically, given that we are given a predictor, like a network, a model, whatever, and given that we know what is our objective, we can try to minimize the discrepancy between the output of the model and the target in order to get the A possible X that produces that output. But this one that I just show you right now, let me hide this bar on the left. Can I? Yes, awesome. But this is like the wrong notation, okay? So this is abuse of notation. Let me show you now what is going to be the full notation and therefore introducing the first time we introduced the energy term, okay? So far, are there questions? I can also show you afterwards how to draw these diagrams, okay? There are no questions. You can type, I can wait 10, 20 seconds and see whether there are questions. So far it is clear, right? What I've been trying to explain. Okay, all right. Moving on, my question to you, right? So yesterday in the chat, and then I answer a lol was, I put a statement, right? I said, back propagation and gradient descent are not only used for training, okay? Why? Answer, type in the chat. They are used also for inference, right? It's very important this, right? So again, what is back propagation doing? What does back propagation do? What is another name for back propagation? Change rule, compute the gradients, okay? Gradient descent, what is it? I'm reading the chat, right? Update, not necessarily the weights, not the weights, right? That's too specific, right? Not parameters, neither parameters, right? Because if it's operating parameters, it's already learning. It's already specifying something, right? So finding something, right? Finding some variables, I guess, right? By minimization of a cost, let's call it this way. And now we understand, it's sort of understanding why we talk about loss, cost, and now we'll be introducing this energy. Again, all these are not necessarily synonyms, although they are all scalar functions, okay? Anyway, the loss, we don't talk about loss this week with me, right? Loss, we talk about next week. Everything we talk about today is gonna be either costs or energies. What are these things? So let me show you what are these things. So despite the font being broken and which is really upsetting me today, I'm sorry, I'm upset, it's really bugging me, but okay, I can't do anything because I'm using online free resources and that's what you get by doing this. So this is the full notation, okay? So for the moment, let's ignore this shaded box. And so I have here a X, with the check, we talk about the check afterwards. The X goes inside the predictor and the predictor gives me a Y tilde. Y tilde is a approximation for possibly the Y. Then on the other side, I have my blue Y, which is my target, okay? I'll tell you in the future why it's blue, but for now it's just blue, a color of my choice. This blue Y and this violet Y tilde, both goes inside the square, okay? And the square represents a scalar function, meaning it gets inputs, maybe vector, so in this case, it gets into two scalars, so you can think about one vector of two elements. And it returns you a value. In this case, C stands for cost. Cost is the penalty you pay for making a misprediction, okay? So if Y tilde is the same as Y, you pay zero cost, right? You have a very small penalty, zero penalty, right? Otherwise, you're gonna pay equal to the square Euclidean distance from my target, the blue Y, okay? So here it's written underneath that my cost is going to be the cost between Y and Y tilde. And since I'm using this interface, I cannot have color math, but later on we have color math. This is simply equal to the square Euclidean distance. And then finally, which is not such a big thing, but again, it's a different thing, we define this capital F being our energy and our energy is function of the inputs. What are the inputs of the system? Tell me. What goes inside the big X and Y, right? So X and Y are the only input to the system. Later on we will have also Z, X, Y and Z, but today we just have X and Y. So F is going to be a level of incompatibility between X and Y, which is equal to the basically single scalar value, right? So again, no big deal. The C is a give a scalar value. With this scalar value is the same value that this F has, right? There's a change of variable, right? So C is a function of Y, the target and the Y tilde, the prediction, whereas F is a scalar function, function of the inputs of the system, right? So this Y tilde is function, right? We computed before, right? Y tilde is a function of the X, right? And so eventually the C is also a function of X, right? But the inputs are Y and Y tilde, right? So these are the two arrows. You can see also the two arrows going into the dash box is going to be the X, X guy and the Y. Finally, the X check, check means it looks like an arrow pointing down, right? Downward. So that's like the minimizer, a minimizer of the energy. So this capital X should be a calligraphic X. I really, the font got broken today, I'm sorry. This should be a calligraphic X telling you this F is a function of this big curly X and A given observed Y, right? So in this case, the Y is observed, right? So it's shaded. So I have this value, right? So like a conditional almost type of energy. I have this Y and then this function changes as I change the values in this position here. So X check is gonna be the value that is minimizing the, how do we call this F? What is it? What is F? F is the type in the chart. The definition of F, the energy and what is the representation? What does it mean? The energy represents the level of, let's type it down on the chart, level of incompatibility between X and Y, okay? Very good. Definitions we have to keep in mind. Very good. Okay, so in the next part of the lesson, we are gonna be going through the notebooks. So I was thinking to, so you had two options, but I already decided which one to do, but I show you where you can find notebooks that are going to be similar to the one I'm covering today, but I was thinking to show you the notebook. Actually, I work on for my book, right? So I'm writing a textbook about this class, basically. I'm very late. I wanted to actually get it done a major part before starting this semester, it didn't happen, but so I will show you the notebook I made for the book, which is very similar to the one you can find online, right? So where to find the resources online? So you go on GitHub, right? Then there is the 2021 edition, but here I only have a few notebooks, which we are gonna be covering in the future. And most of the notebooks are from the 2020 edition. I haven't changed them too much. Well, I changed them for the book, right? But the repository for the book is not yet available online because it needs to be fixed, right? I will answer the question in a second. So the one we are gonna be going over right now is gonna be the number two, space stretching or some variation of this, right? Is there, before I answer the question, is there anyone in class not familiar with NumPy? I repeat, is there anyone in class out of 90 people not familiar with NumPy or NumPy or NumPy or whatever you want to call it in French or Spanish? NumPy? I'm like, yeah, no, I'm just kidding. Sorry. Okay, so it seems like that either people didn't hear the question that people didn't know or didn't hear or at least maybe everyone knows about this, right? Anyway, so I'm leaving this tensor tutorial for yourself, right? And so just go over this first notebook by yourself. It explained a little bit how to use Jupyter notebooks, how to pull out the documentation and so on, okay? So this is like very kind of basic, basic tutorial. I don't want to go over these in class because it's gonna be mostly a waste of time for most of you since, again, it seems that you already have some sort of exposure to NumPy and PyTorch has a very similar interface for actually design now, right? PyTorch was not, it's not based on NumPy actually, right? Someone posted that on Twitter a few weeks ago, it's incorrect. PyTorch derives from Torch which we developed initially using Lua scripting language which just for the reason because it was super light and there was no Python community at the time and it didn't make any sense. Then my student decided to port basically Torch to Python and came up with PyTorch during like a few months internship at Facebook and now the whole world is using this framework and I feel so proud about that, right? Anyway, that's mostly the reason why we are using that but also because it's like state of the art, right? Anyway, today we're gonna be covering this one here, space stretching, okay? So how to get everything running, right? So here underneath you have instructions about how to get your mini-Konda environment up and working in 15 different languages. If you find bugs and typos in any of the translations and any language language you speak, please send a poor request. This is really highly appreciated, right? I don't speak all these languages so these are being like all contribution from people that gave some time to this course. Anyway, so if you install all these things, you're gonna get the different environments. There's gonna be the PDL, PyTorch, deep learning and then you can either run the JupyterLab or the notebook, the lab is gonna be much nicer, I think, but these have been developed with Jupyter notebook. So some things might not quite work out of the box for the JupyterLab. Anyway, so I was going to tell you that instead of going through this notebook here, you can check this out, right? So my expectation and also someone asked me today, do we follow along with a notebook in class? No, well, you follow what I'm doing in class, right? And then afterwards, you're kind of supposed to spend at least one hour per notebook by your own time, right? Such that you get familiar, you check the documentation, you have some bugs, you check the dependencies, I don't know, you should get familiar with the code such that once, when you have to actually do the homework, you already know most of the things, right? How they work. So instead of doing that, we'll just use the book one. Let me see, there was a question. What is the difference between the check sign and the hat over x? Okay, awesome, that's a good question. So we don't have yet a hat over the thing, right? So far in this today, we saw either the tilde, which means a approximation because it's like when you write, you know, pi equal more or less 3.14, right? Pi is not 3.14, but it can be approximated as 3.14 so that tilde means it's an approximation. It's not exactly the value you get, right? So y tilde tries to be similar to y. How we do that by plugging here, basically a spring. What is a spring, right? Have you taken physics? Yes, what is the force of a spring? Like, like, like, like, f equal minus kx, very good, right? And this is called hook law. What is the energy of a spring? Oh, also, let's call it x naught to be the central equilibrium point, right, of the spring, of relaxation of the spring. One half, yeah, let's use the relaxation point at x naught, right? Right, otherwise you have the spring that flips at zero, right? Like, or let's have a reference system that you plug the spring at zero and you have it at equilibrium at x zero, right? There we go, right? So you have one half kx minus x naught square, right? And that really looks very similar to this thing, right? Well, despite the one half and k, right? But again, those are just constant, right? Whatever, we don't care. So if you look at this thing, this basically represents the energy contained in a spring which is being displaced by y tilde minus y, okay? You see that? And so once, when we will be training the system later on, we'll basically tune the parameter such that all the springs will get relaxed. Or on the other case, when we will try to infer the y check, right? Sorry, the x check, right? How to find this x check? You have a specific y, you connect the spring between the y and the y tilde and then you start moving the x check such that you try to relax the spring, okay? I hope it makes sense. If it doesn't, just think more about what I said. If you don't remember what I said, check the record. It doesn't, okay? It's okay. You have 14 more lessons to get acquainted with all these things. Don't worry. Also remember, I'm sarcastic a lot of time. In the last class, you said you will be telling the difference between y tilde and y hat, okay? We haven't seen y hat yet. And if you see a y hat, if you have seen any y hat so far, it's just a typo, okay? I know, I don't have the typo, but Jan may have it in his slides because I've been trying to update that. We only have y tilde for prediction, y check for the output of minimization, okay? So the check points downwards, which means it's a low energy value, okay? And then you're gonna see later on, we're gonna have a hat. Possibly next week, I will show you, next week or the week afterwards, I will show you an example of y, of x hat or y hat or whatever hat. That is called a contrastive sample, okay? It's supposed to be a sample to which we want to assign a high energy. Again, we doesn't make any sense right now because we haven't really talked about high energies. We talk about low energy, right? So there is this paper from a co-worker of mine, which collected, well, Larry Alpinto, which has collected several examples of grasping some object with a tool, right? With a robot. And so you have the picture of the grasping or like the orientation. You have picture of the object, picture of the orientation of the player I think it's called in English. And then you have either it was successfully grasped or it was unsuccessfully grasped, okay? So you want to learn what is the correct orientation. You could do that by only using the small fraction, let's say 1% of your data. You throw away all the unsuccessful cases. You just use the successful cases and you train a regressor over the correct orientation of your player given the object, okay? That would be horrible, you know, because you throw away all the data point, right? The alternative is to use the classification. Let's say you discretize all possible orientation into, I don't know, 10 possible angles or whatever 36 possible angles. So you'll have 10 each, 10 degrees each. And then you use Y, the blue Y for the try to increase the probability that the model will select that orientation. Or you're gonna consider a Y hat, which is a negative, a contrastive sample in order to decrease the probability of that specific orientation, okay? And so those are when you run back propagation and when you run gradient descent, you actually had to run gradient ascent, okay? So you can run gradient ascent in the contrasting sample when you had a contrasting sample or just put the minus, right? So if you want to run gradient ascent for the contrasting sample, you can simply take the loss and put a minus in front. Like you're gonna get the flipped gradient, right? So the point is that in that case, once you collect correct and incorrect cases for classification, you want to tell that that specific image should not have that orientation associated because it's not gonna work, okay? So that's a good question, right? So Y hat in this case is used for classification kind of a proxies for regression which wouldn't actually use all the data you collected. This is super cool, I think. You can check out the paper. Anyway, so let's move on, right? Otherwise we don't go anywhere. So let me show you a little bit of the actual book. And then a few, a bit of notebook, okay? So here I'm showing you a neural net basically is a stack of linear and non-linear layers, okay? So if you think about a neural net, I'd like you to think about a sandwich, okay? So what happens if you have like a sandwich made only of slices of bread? How do you call it a sandwich made of only slices of bread? It's a loaf, right? It's a loaf of bread, right? So in order not to have a loaf of bread, you need to put what? Something in between the layers, right? Similarly, a neural net that is made only of linear layers is gonna be simply a big linear layer. Instead, we're gonna need to have like a sandwich between linear and non-linear layers. We saw that also from the slide with Jan yesterday with that kind of chain of multiple things. So again, we're gonna be using this kind of notation which is helping us understanding the flow of information through the network. So I will call my generic input R and then the output is going to be S. Where can you find this textbook? That's a very good question. On my website, when I finish writing the first draft, so far you just get a preview. It's not being peer review, so it's not yet available for others to read. First, Jan has to fix my errors and then it's gonna be available for everyone to read. Anyway, the things I'm telling you, at least I know they are correct. You're gonna have this linear layer and how do we think about linear layers? So again, talking about this SVD, any linear layer can be decomposed into three items. It's gonna be a rotation, a scaling and another rotation. Moreover, whenever you are in high dimensional space, we just reason and think about this linear layer basically just as one big rotation. And basically we discard this other two. We just, usually we just think about rotation because let's say you are in a hundred dimensional space. In how many ways can you change the orientation of a vector? Let's say it moves around a hypersphere. If you're in a hundred dimensional space, in how many directions can you, let's say if you have two dimensions, right? Two dimensions, how many directions can you rotate the thing, right? It's gonna be one dimension, like you can go around a circle, right? In two dimensions, right? So you have one degree of freedom to change the orientation. And then one degrees of freedom of changing the length. If you are in a three dimensional space, you're gonna have a surface, so I have a two infinite way of changing, two degrees of freedom of changing the orientation. And then one degree of freedom of changing the length, right? If you are in 100 dimensional space, you're gonna have 99, yeah, 99 degrees of freedom for changing the orientation. And then just one degree of freedom of changing the length. And so once, when we want to compare vectors in high dimensional space, we care about the alignment of vectors more than the length of vectors, right? Because the alignment can be much more descriptive, right? It can contain much more information. It has a lot of more capabilities of expressing differences, right? Of encoding of representation, whereas the length is gonna be just one degree of freedom. So again, although a linear transformation can be split up in rotation scale and rotation, usually when you just reason about linear transformation in high dimensional space, we just reason in terms of rotations, okay? So most of the time when you hear me talking about rotation, it means a linear transformation in a high dimensional space. Anyway, one more thing is the fact that whenever you write the math in this way right here, it's annoying again because things are flipped, right? So R is the input. Then a first operator is this V, which is a rotation. Then you have this Sigma, which is having the scaling factor. Then you have this U, another rotation. Then you get the output, right? So everything is flipped. Whereas here you can clearly see that which one is happening first and which one is happening later. Anyway, so a little bit of math, we don't care. And this is what is the big picture. So whenever you have a matrix, it's just a bunch of numbers. And this is, they don't really tell you much. Instead, if you perform a SVD decomposition, it's like making a radiography of your transformation, such that you can actually clearly tell what's happening. Any matrix, any linear transformation rotates a pair of orthogonal axes in the input space into a different orientation here in the output space. And then it changes its magnitude independently. So you have one big rotation from this side here to the right side over here. And then each of them have its own length change. So what happens is that you have a circle that is getting rotated and then squashed in one specific direction. Moreover, the first singular value, since they are actually ordered, is gonna tell you which is gonna be the maximum deformation, maximum length of a transform vector. So let's say you have your whatever input vectors and you have sigma one equal three. You know that this matrix multiplication can provide you up to a zooming of factor equal three, okay? It can scale up things up to three. Why do I say up to? Because let's say sigma two is gonna be zero. It might kill completely all other vectors that goes in this direction, for example. Anyway, so this is two mathematical, two boring. And so instead, what we are gonna be doing is gonna be actually having a little bit of fun now with the notebook, right? I'm gonna be showing you exactly this. I will have like a bunch of points. I will have a random matrix, a two by two matrix, which is performing, what is a matrix doing? Matrix is gonna be a linear transformation and we said that it performs, we can be thinking about it as performing what operation. Type, type, type, type. Quick, quick, quick, quick. Rotation and some stretching and then some other rotation, okay? Again, in 2D, we can actually think about combining this into like shearing and other things, but again, it becomes a little bit too messy. We don't care about that in high dimensional space. So we have some skating, some rotation. We're gonna get basically this circumference rotated. You can see these two axes that are telling you what is the orientation of this circumference. And you can see these green points are in correspondence of V1. They get rotated down here, right? On the U1 and then they get scaled. In this case, a little bit less than one here, which I showed now with the notebook. So let's have a look a bit at the notebook and so we can learn how to use PyTorch. Again, my recommendation is after you, we are done with this lesson in 10 minutes, you pull out the Jupyter notebook, either the one on the GitHub or you just try type yourself and just get yourself familiar with all these different functions we talk about today, okay? In order to get some sort of hands dirty, right? Hands on experience on playing with mathematical tools, right? This is a numerical calculus kind of discipline. It's a empirical discipline. You need to actually try out things, not only to make sure your mathematics is correct, but also to acquire a intuition, right? About how these things work, right? Without intuition, it's really hard to make educated guesses about how something could work. If you just think about coding up everything you might come up with in your mind, you will never have enough time to try everything. But instead, if you use your mind as a predictive model, you can come up with good possible ways of performing some, getting a good reward, a low cost in the future. But in order to train a predictive model and what you need to do, how do we train predictive models? Do we know? I mean, we haven't yet talked about predictive models, but do we know, do you know the answer of this question? We're in the same, yes, but on a lot of self, a lot of data, right? You have to play with observations and get your mind to be able to predict the future, right? Anyway, so we can go here on the, usually you're gonna go on your Github and so on, right? Github and then you're gonna go on PyTorch Deep Learning, right? The repository we talked about before, but we're gonna be going now inside Book and Python, which is not yet available to you. I know that. I'm gonna have, you're gonna be doing Conda, activate, activate, you're gonna be doing basically activate PDL for PyTorch Deep Learning, but I will be doing activate Book, right? Always use environments. And then I also have alias for JL, which is JupyterLab, such that it's quickly to launch it. Let me know if anything is not clear on the code, right? Because sometimes I go too fast in the code and I don't realize that. I try to go to a decent speed. So here basically all we care is gonna be the fact that I import Torch. And then I also, from Torch, I import Optime and NN, which are going to be useful later on. Then I import other utility functions I define, which are gonna be slightly different in the version on Github. Doesn't really matter. These are just the plopping routines. Here I create a folder for saving the results that just for generating the book. And here what I'm doing is gonna be sampling P, capital P, and we asked yesterday what it was. Someone, P is gonna be the total number of samples. So it's gonna be 100 samples from a random normal distribution, right? With 100 samples of two dimensions. Then I have my labels equal all zeros. It's gonna be one if you are basically going to be in the second quadrant. Two, if you're gonna be in the third quadrant. And three, if you're gonna be in the fourth quadrant. Why is that? Such that you can tell the orientation of things and whether there is a flipping. When do we have a flipping when once we apply a linear transformation? Do you know what I'm talking about? When do we have a reflection? When the determinant is negative, yes. Okay, so here I plot the input space, which we call R, right? So we have R1 and R2 with a different four colors, okay? And then here I'm gonna be creating, I just use torch.manual seed in order to have always the same random numbers whenever I create these linear layer such that things are consistent, especially for writing the textbook. And in this case, it's gonna be a two by two linear transformation without bias. What does it mean? So if there is a bias, there is going to be also an additional shifting after applying the rotation. A rotation, well, rotation like linear, right? Rotation, linear transformation plus a bias is also called a fine transformation. Again, this helps us get in the model more degrees of freedom to play with, okay? But usually, yeah, no, usually, that's it. So here I just generate this linear two by two, which is basically a matrix, a matrix, okay? What I'm doing here below is going to be the following. I say torch, please, please, do not track the computational graph. What does this mean? In PyTorch, every time you use a model that has weights inside, has parameters, PyTorch keeps track of the computational graph such that once I'd like to perform backward, it knows what are the operations you need to get the Jacobian and multiply the output gradient with, okay? We saw yesterday that the forward pass, basically you put some input, you have the X input, whatever, inside the model, you get something at the output. Eventually, you're gonna get a final number one, which is the output derivative. You basically, that's the output gradient. You send it backward, and how does it get affected by those modules that went, where the signal went forward first? Well, you have to multiply the output gradient by the Jacobian of that individual module, given that we fed a specific input before, right? So most of the time we also need to keep the previous input nearby, because that's more often time used in the backward pass. So the output gradient needs to be multiplied by the Jacobian, and then it keeps getting propagated backwards through the model, okay? Again, the model, the PyTorch, therefore creates a computational graph with the sequence of operations that we used in the forward pass. Since today we're gonna be talking about just fit forward, we don't care about going backwards, we're gonna be using these, what's calling in Python I forgot, with, it's called the help me out, context manager, is it correct? Context manager, yeah. So this is gonna be a context manager which disables PyTorch tracking of sequence of operations, okay? And so basically by running these, I send R, what was R? R was the bunch of points, right? The 100 points in two dimensional space. So I send the whole R inside the model, and this guy spits me out, S, right? Which is going to be what? How many points? You're following? How many points are in S? 100, yes. So it's gonna be the same 100 points but then rotated basically, right? Rotated because there is no bias, right? There is no shifting. Here I compute these U, Sigma and V transpose with these torch dot linear algebra dot singular value decomposition, to which I input the weights, so which is the metrics, and I use this dot detach, right? So this is also important. When I do dot detach, basically I disconnect the computational graph that was made, any computational graph that was made before. So this dot detach is giving me the actual metrics without, okay, how does it work? Let me explain again. PyTorch keeps creating, it creates a computational graph every time you use something that require a gradient, okay? So the weights are some parameters in PyTorch-jergon which automatically require gradient. Now, if I call this dot detach, I stop, I prevent a computational graph to be performed, to be created, right? So one way to do that was with this way, right? Here I say, do not create a computational graph and just proceed forward with this thing. And here I didn't have access to that weights, right? So here this is a module, right? This is object in Python. I'm sending something inside the object and then something comes out. So I cannot cut the graph or whatever, right? Or I could have said here at the end, dot detach. So I could have created something which has a computational graph and then I could have done detach here, right? But then this one does create a computational graph. Instead, why didn't I do that here below because it's a different way of performing something similar? Here I'm getting as input, this is a parameter, but then I say, oh, it's no longer a parameter. This is just the values, the tension, right? And then I compute this singular value decomposition of basically the values, not the parameters, okay? Let me see the question here. Does it initialize the weights to some random by default? Yes, so this here and then dot linear, you can check the initialization on the PyTorch documentation, okay? PyTorch initialization, okay? So you can check later on the manual how things are initialized, okay? I would really, really recommend you going through the whole documentation of PyTorch to be sure you know what's going on there, okay? So here I compute this singular value decomposition to find the orientation of things and to find the initial orientation of that pair of axes in the input space, find the orientation on the outer space. Both of them are orthogonal, right? So these are called singular vectors. Left singular vectors, right singular vectors. The property of singular vectors are that they are orthogonal. These are orthogonal, not the eigenvectors, okay? Anyway, so pair a bunch of axes in VT in the input space, a bunch of axes in the output space and they have this scaling matrix sigma, capital sigma, okay? Here I transpose such that we have column vectors and then this U sigma, I just scale these vectors by those scaling factors. So here I show you again my, what do I do? Here I create a circle, such that I can plot things, okay? Is the touch like a deep copy? Absolutely not. There is no copy going on. I'm just using this as input and nothing will change to this thing. This is just an input of the system. It's not a copy at all. You want to do clone, right? If you want to do a deep copy of the vector, right? You want to do clone. Now that would be a new instance in the memory, okay? But we don't need that here. Okay, so here I create a circle, right? With cosine and sine, tau is two pi, right? But it's better tau. Anyway, here the interesting thing. The circle, what is this circle? How many points are inside the circle? Do you see? 100. And so circle is a bunch of 100 points in two-dimensional space that are equally spaced across this unitary circumference. And then I put this inside this linear transformation and I get an ellipse. And then here I show you the input space and the output spaces, okay? This is going to be the input space with the V1 and V2. And then it's going to be the output space which has the U1 and U2 in different orientation and shrunk, right? So you can tell that the largest, again, well, not again, something. This is the largest possible transformation that this matrix is. It's going to be something like 0.9 or something, okay? Finally, quickly, quickly, such that I don't waste, I don't take too much of your time. We said this is going to be, these are the linear transformation which are basically rotations. But we also have something else. And I will take those questions later after the class, right? So these are going to be called activation functions which are acting most of the time on the singular level, single item individually, okay? In this case, I have the positive part or reload or we had the leaky reload, which has like a compression factor which is not compressed to infinitely zero, right? It has a compression factor of, in this case, 120th. Other options are going to be the hyperbolic tangent and the sigmoid. What happens now, again, these are going to be the transfer function. It tells you how the output, the A activation output changes given that you change the S, the summation, the linear output you get from the previous part. And so these are going to be the four options you may get, right? If this is my input on the top left, on the top right, you're going to have this leaky reload. If you just have the reload, you completely kill the second, third, and fourth quadrant. Here you just compress 20 times the horizontal part. You compress 20 times both sides, coordinate, and you compress the vertical coordinate by 20. If you have hyperbolic tangent, you basically get a square out of the circle. And the last one, you're going to get some smooth kind of square centered in 0.5, 0.5, okay? How did I do that? Simply I initialize here. I just used torch dot hyperbolic tangent, logistic sigmoid, or this leaky reload. And then I just exactly do the same as I showed you on before. Let me show you once here. I just send my point, the 100 points through this hyperbolic tangent, the sigmoid or the leaky reload, okay? Finally, once you can put this together, you're gonna get the fully connected architecture, which is the most generic architecture we encounter. This has no assumption over the input. It has the first part, which is going to be the linear part, right? This kind of rotation. And then each branch, each dimension, you can see here, gets F, this non-linearity, which is spitting out this activation value, okay? And the activation are the value of the non-linear function, okay? Whereas this S is just the linear summation of this initial components, okay? And this is called FC for fully connected model. Again, we use this kind of bullet shape to represent all these operations here. And what happens if you do that? You're gonna get this thing here. I show you a three-layer, three fully connected layers, one two to 20, 20 to 20, and then 20 to two. So I have one, two hidden layers, right? The output is gonna be S and green because this is gonna be a linear output. I don't have a non-linearity on top. I just have a non-linearity here and non-linearity here. So these are again my input points, sample from that normal distribution, with a few more level lines such that we can see what happens. And here on the bottom part, you can see the network which has the hyperbolic tangent, which is kind of smooth kind of transformation, but it's not the best for learning arbitrary transformations. And on the right hand side, you can see this network with the positive part, with the real part, okay? So this is pretty much all I wanted to show and I really went over with the time. So are there, I would say, just go over yourself with like after class, go over these two notebooks, okay? The first two, the tensor tutorial, okay? And the space stretching. And lesson is dismissed. Sorry for going a little bit over, okay? More questions in the kappa-pa-pa-pa? Yes, thanks, bam, Wikipedia done, finished, right? I think I answer everyone. You're very active, I like it. You're more active than last semester. You're a good batch. I hope it's gonna keep this way, right? Especially later on when the concepts get a little bit harder to understand. If you don't ask questions, it's hard for us to address your doubts because again, we can't really read your mind. A bit we can, right? We've been doing this for a while. It's been 10 iterations now, but still you should speak up, speak your mind and no shame in asking any kind of question that is related to the class, right? Okay, enough. Bye-bye, see you next week. Enjoy the weekend, I go dancing. If you want to dance, it's a very nice studio. It's called Empire Mambo. We are starting the new cycle in two weeks. Again, advertisement, I don't get paid, but it's so good. It's like a school number one in the world for dancing. We have it here in New York City, so. Anyway, bye-bye, see you next week.