 OK. So first of all, I have a terrible confusion to make. This class is actually being run not by me, but by these two guys, Alfredo Condiani and Mark Goldstein, whose name are here. They are the TAs. And you'll talk to them much more often than you'll talk to me. That's the first thing. The other confusion I have to make is that if you have questions about this class, don't ask them at the end of this course because I have to run right after the class to catch an airplane. But I can't wait until next week. OK. So let's dive right in. Some very basic course information. There is a website, as you can see. I will do what I can to post the PDF of the slides on the website, probably just before the lecture, probably just a few minutes before the lecture, usually. But it should be there by the time you get to class. So please, by the time I get to class, there's going to be nine lectures that I'm going to teach on Monday evenings. There is also a practical session every Tuesday night that Alfredo and Mark will be running. So they'll go through some of the practical questions, some refreshers on mathematics are unnecessary for this and basic concepts, some tutorials on how to use PyTorch and various other tools. And there's going to be three guest lectures. The names of the guest lecture are not finalized, but it's going to be on topics like natural language processing, computer vision, self-supervised running, things like that. There's going to be a mid-term exam, or at least we think there is. And it's going to take one of those sessions around March. And the evaluation will be done on the mid-term and on the final project. And you can sort of, you know, ban in groups of two, do we say two or three or just two? We didn't decide yet. We'll see. The project will probably have to do with a combination of self-supervised running and autonomous driving. We are discussing with various people for data and things like that. Okay, let me talk a little bit about, so this first lecture is really going to be sort of a broad introduction about what deep learning is really and what it can do and what it cannot do. So it will serve as an introduction to the entire thing. So we'll go through the entire arc if you want of the class, but in sort of very superficial term so that you get sort of broad, high-level idea of all the topics we're talking about. And whenever I'll talk about a particular topic, you will see where it fits in this kind of whole picture. But before that, so there's a prerequisite for the class which is, you know, you need to be kind of familiar with machine learning or at least basic concepts in machine learning. Who here has played with PyTorch TensorFlow, has trained neural net? Okay, who has not done that? Don't be shy, okay? Okay. So the majority has, which is good. But I'm not going to assume that you know everything about this, particularly I'm not going to assume that you know a lot of the, you know, sort of deep underlying techniques. Okay, so here is the course plan and depending on what you tell me, I can address this and sort of, you know, go faster on certain sections that you think are too obvious because you've played with this before or other things. So intro to supervised learning and neural net deep learning, that's what I'm going to talk about today, what deep learning can do, what it cannot do, what are good features. Deep learning is about learning representations. So I'm going to talk about next week will be about back propagation and basic architectural components. So things like the fact that you build neural nets out of modules and you connect with each other, you compute gradients, you get automatic differentiations and then there's various types of architectures, loss functions, activation functions, you know, different modules, tricks like weight sharing and weight tying, multiplicative interactions, attention gating, things like this, right? And then particular macro architectures like a mixture of experts, I mean net hyper networks, et cetera. So we'll dive pretty quickly in and that's appropriate if you've already played with some of those things. Then there will be either one or two lectures that I haven't completely decided yet about convolutional nets and their applications. One of them might end up being a guest lecture. Then more specifically about deep learning architectures that are useful in special cases. So things like recurrent neural nets with back propagation through time, which is the way you train recurrent neural nets. And sort of applications of recurrent neural nets to things like control and predicting time series and stuff like that. Then things that combine recurrence and gating and multiplicative interactions like gated recurrent units and STM. And then things that really use multiplicative interactions as kind of a really a basis of their architecture like network transformers, adapters, et cetera, which are sort of very recent architectures that have become extremely popular in things like NLP and other areas. And then a little bit about graph neural nets, which I'm not gonna talk about this a lot because there is another course that you can take by John Brina where he spends a lot of time on graph neural nets. Then we'll talk about how we get those deep learning systems to work and so various tricks to get them to work. Sort of understanding the type of optimization that takes place in neural nets. So the type of, we use, of course, learning is almost always about optimization and deep learning is almost always about gradient-based optimization. And there are certain rules about optimization in the convex case that are well understood, but they're not well understood when the training is stochastic, which is the case for most deep learning systems. And they're not very well understood also in deep learning because the cost function is not convex. It has local minima and saddle points and things like this. So it's important to understand the geometry of the objective function. I say it's important to understand, but the big secret here is that nobody actually understands. So it's important to understand and nobody understands. But there are a few tricks that have come up through a combination of intuition and a little bit of theoretical analysis and empirical search. Things like initialization tricks, normalization tricks, and regularization tricks like dropout, grading clipping is more for optimization. Things like momentum, average SGD, the various methods for parallelizing SGD, many of which do not work. And then something a little exotic called target pop and the Lagrangian formulation of back pop. Then I'll switch to my favorite topic, which is energy-based models. So this is sort of a general formulation of a lot of different sort of approaches to learning, whether they are supervised or supervised, and whether they involve things like inference, like searching for the value of variables that nobody tells you the value of, but that your process, your system is supposed to infer. So that could be thought of as sort of a way of implementing reasoning with neural nets. So you could think of reasoning in neural nets as a process by which you have some energy function that's being optimized with respect to some variables. And the value you get as a result of this optimization is the value of those variables you were trying to find. And so there is sort of the common view that a neural net is just a function that computes its output as a function of its input. So you just run through the neural net, you get an output. But that's a fairly restrictive form of inference in the sense that it can only produce one output for a given input. But very often there are multiple possible answers to a given input. And so how do you kind of represent problems of this type where there are multiple answers, multiple possible answers to a given input? And one answer to this is you make those answers the minima of some energy function. And your inference algorithm is going to find values of those variables that minimize its objective function. And there might be multiple minima, so that means your model might produce multiple outputs for a given input, okay? So energy-based models, it's kind of a way of doing this, a special case of those energy-based models are probabilistic models, like Bayesian methods, graphical model, Bayesian nets and things like this. Energy methods are a little more general, so a little less specific. So special cases of this include things like what people used to call structure prediction. And then there is a lot of applications of this in what's called self-supervised learning. And that would be the topic of the next couple lectures. So self-supervised learning is a very, very active topic of research today. And probably something that's going to become really dominant in the future. It's already, in the space of a year, it's become dominant in natural language processing. And in the last few months, just three months, there's been a few papers that show that self-supervised learning methods actually work really well in things like computer vision as well. And so my guess is that self-supervised learning is going to take over the world in the next few years. So I think it's useful to hear about it in this class. So the things like, I'm not going to go through a long list of this, but there are things that you may have heard of like variational autoencoders, denoising autoencoders, BERT, which is those transformer architectures that are trained for natural language processing. They're trained to self-supervised learning and they're a special case of denoising autoencoder. So a lot of those things you may have heard of without realizing they can be all understood in the context of this sort of energy-based approach. And that includes also generative adversarial networks which I'm sure many of you have heard of. And then there is self-supervised learning on beyond. So how do we get machines to really become really intelligent? They're not super intelligent. They're not very intelligent right now. They can solve very narrow problems very well, sometimes with superhuman performance, but no machine has any kind of common sense. And the most intelligent machines that we have probably has less common sense than a house cat. So how do we get to cat-level intelligence first and then maybe human-level intelligence? And I don't pretend to have the answer, but I have, you know, there's a few ideas that are interesting to discuss in the context of self-supervised learning there. And perhaps some applications. Any question? So that's the plan of the course, okay? It might change dynamically. But at least that's the intent. Any question? Okay. So we're having assignments in the course? Yeah. Yeah, there are assignments, yeah. Okay, so for those of you who didn't hear Alfredo, because you didn't speak very loudly. Yeah, sorry. The final project is actually gonna be a competition between the teams. So there's gonna be like a leaderboard and everything. And in preparation for this, the assignments will be basically practiced to get familiar with all the techniques that you would need for deep learning in general, but for the final project in particular. Right. Also for my term, obviously. Okay, so most of you probably know that, and this is gonna probably be boring for some of you who've already played with those things, but let's start from the basics. Deep learning is inspired by what people have observed about the brain, but the inspiration is just an inspiration. It's not, the attempt is not to copy the brain because there's a lot of details about the brain that are irrelevant and that we don't know if they are relevant actually to human intelligence. So the inspiration is at kind of a conceptual level. And it's a little bit the same as airplanes being inspired by birds, but the underlying principles of flight for birds and airplanes are essentially the same, but the details are extremely different. They both have wings, they both generate lift by propelling themselves to air, but airplanes don't have feathers and don't flap their wings. So it's a bit the same idea. And the history of this goes back to a field that has kind of almost disappeared or at least changed name now called cybernetics. If you want a specialist about the history of cybernetics, he's sitting right there, Joe LaMellin, here is your hand. So Joe is actually a philosopher and he's interested in the, he actually has a seminar on kind of the history of AI in the, in what department this is? Media, culture and communication. Media, culture and communication, so not CS. And he knows everything about the history of cybernetics. So it started in the 40s with two gentlemen, McCulloch and Pitts. Their picture is on the top right here. And they came up with the idea that people at the time were interested in logic, but neuroscience was a very sort of nascent field. And they got the idea that if neurons are basically threshold units that are on or off, then by connecting neurons with each other, you can build Boolean circuits and you can basically do logical inference with neurons. And so they say, you know, the brain is basically a logical inference machine because the neurons are binary. And this idea, so the idea was that a neuron computes a weighted sum of its inputs and then compares the weighted sum to a threshold. It turns on, if it's above, the threshold turns off if it's below, which is sort of a simplified view of how real neurons work, a very, very simplified view. That model kind of stuck with the field for decades. Like, almost four decades. Actually, a full four decades. Then there was, you know, quasi-submaternously, Donald Hebb, who had the idea that neurons in the brain, I mean, it's an old idea that the brain learns by modifying the strength of the connections between the neurons that are called the synapses. And he had the idea of what's now called Hebbian learning, which is that if two neurons fire together, then the connection that links them increases and if they don't fire together, maybe it decreases. That's not an idea for a learning algorithm, but it's sort of a first idea, perhaps. And then cybernetics was proposed by this guy, Norbert Wiener, who is here. And this is the whole idea that, by having systems that kind of have sensors and can have actuator, you can have feedback loops and you can have self-regulating systems and what's the theory behind this? You know, we sort of take that for granted now, but the idea that, you know, you have things that kind of, for example, you know, you drive your car, right? You turn the wheel and there's a so-called PID controller that actually turns the wheel in proportion to how you turn the steering wheel. And it's a feedback mechanism that basically measures the position of the steering wheel, measures the position of the wheel of the car, and then if there is a difference between the two, kind of corrects the wheels of the car so that they match the orientation of the steering wheel. That's a feedback mechanism. That, the stability of this and the rules about this are basically all come from, initially, from this work. That led to a gentleman by the name of Frank Rosenblatt to basically imagine learning algorithms that modified the weights of very simple neural nets. What you see here at the bottom, the two pictures here, this is Frank Rosenblatt and this is the perceptron. This was a physical analog computer. It was not a three-line Python program, which is what it is now. It was a gigantic machine with, you know, wires and optical sensors. You could show it pictures. It was very low resolution. And then it had, you know, it had neurons that could compute the weighted sum and the weights could be adapted and the weights were potentiometers. Potentiometers had motors on them so they could rotate for the learning algorithm so there was electro-mechanical. And what he's holding here in his hand is a module of eight weights with, you can count them, with those motorized potentiometers on them. Okay, so that was a little bit of history of where neural nets come from. Another interesting piece of history is that this whole idea of sort of trying to build intelligent machines by basically simulating networks of neurons was born in the 40s, kind of took off a little bit in the late 50s and completely died in the 1960s, in the late 1960s. When people realized that with the kind of learning algorithms and architectures that people were proposing at the time, you couldn't do much. You know, you could do sort of basic, very simple pattern recognition, but you couldn't do much. So between 1969 roughly and, or 1968 and 1984 I would say, basically nobody in the world was working on neural nets, except a few kind of isolated researchers mostly in Japan, because Japan has its own kind of relatively isolated ecosystem for funding research. People don't listen to the same kind of fashions if you want. And then the field took off again in 1985 roughly with the emergence of back propagation. So back propagation is an algorithm for training multilayer neural nets as many of you know, and people were looking for something like this in the 60s and basically didn't find it. And the reason they didn't find it was because they had the wrong neurons. They were using macular pixel neurons that are binary. And the way to get back propagation to work is to use an activation function that is continuous, differentiable, or at least continuous. And people just didn't have the idea of using continuous neurons. And so they didn't think that you could train those systems with gradient descent, because things were not differentiable. Now there's another reason for this, which is that if you have a neural net with binary neurons, you never need to do, you never need to compute multiplications. You never need to multiply two numbers. You only need to add numbers, right? If your neuron is active, you just add the weight to the weighted sum. If it's inactive, you don't do anything. If you have continuous neurons, you need to multiply the activation of a neuron by a weight to get a contribution to the weighted sum. It turns out before the 1980s, multiplying two numbers, particularly floating point numbers, on any sort of non-ridiculously expensive computer was extremely slow. And so there was an incentive to not use continuous neurons for that reason. So the reason why backprop didn't emerge earlier than the mid-80s is because that's when computers became fast enough to do floating point multiplications, pretty much. People didn't think of it this way, but that's retrospectively, that's pretty much what happened. So there was a wave of interest in neural nets between 1985 and 1995, lasted about 10 years. In 1995, it died again. People in machine learning basically abandoned the idea of using neural nets for reasons that I'm not gonna go into right now. And that lasted until the late 2000, early 2010. So when around 2009, 2010, people realized that you could use multilayer neural nets, train with backprop, and get an improvement for speech recognition. It didn't start with ImageNet, it started with speech recognition around 2010. And within 18 months of the first papers being published on this, every major player in speech recognition had deployed commercial speech recognition systems that use neural nets. So if you had an Android phone and you were using any other speech recognition features in an Android phone around 2012, that used neural net. That was probably the first really, really wide deployment of kind of modern forms of deep learning if you want. Then at the end of 2012 or 2013, the same thing happened in computer vision, where the computer vision community realized deep learning commercial nets in particular worked much better than whatever it is that they were using before and started to switch using commercial nets and basically abandon all previous techniques. So that created the second revolution now in computer vision. And then three years later, around 2016 or so, the same thing happened in natural language processing, in language translation, and things like this, 2015, 16. And now we're gonna see, it's not happened yet, but we're gonna see the same revolution occur. In things like robotics and control and a whole bunch of application areas. But let me get to this. Okay, so you all know what supervised learning is, I'm sure. And this is really what the vast majority, and not the vast majority, like 90-some percent applications of deep learning use supervised learning as kind of the main thing. So supervised learning is the process by which you collect a bunch of pairs of inputs and outputs, of examples of, let's say, images together with a category, if you wanna do image recognition, or a bunch of audio clips with their text transcription, a bunch of text in one language with a transcription in another language, et cetera. And you feed one example to the machine. It produces an output. If the output is correct, you don't do anything or you don't do much. If the output is incorrect, then you tweak the parameters of the machine, think of it as a parameterized function of some kind, and you tweak the parameters of that function in such a way that the output gets closer to the one you want, okay? This is in non-technical term, what supervised learning is all about. So show a picture of a car, if the system doesn't say car, tweak the parameters. The parameters in a neural net are gonna be the weights on the, you know, that compute weighted sums in those set of simulated neurons. You tweak the knob so that the output gets closer to the one you want. The trick in neural nets is how you figure out in which direction and how much to tweak the knob so that the output gets closer to the one you want. That's what gradient, competition, and back propagation is about. But before we get to this, a little bit of history again. So there was a flurry of models, basic models for classification, starting with a perceptron. There was another competing model called the et align, which is on the top right here. They're based on the same kind of basic architectures. Computer weighted sum of inputs, compared to a threshold, if it's above the threshold, turn on, if it's below the threshold, turn off. What you see, the et align here, the thing that Bernie Widrow is tweaking is actually a physical analog computer again. So it's like the perceptron, it was much less, you know, much smaller in many ways. And the reason I'm telling you about this is that the perceptron actually was a two-layer neural net. The two-layer neural net in which the second layer was trainable with adaptive weights. But the first layer was fixed. In fact, most of the time with most experiments, it was determined randomly. You would like randomly connect input pixels of an image to neurons that would be threshold neurons with random weights, essentially. This is what they call the associative layer. And that basically became the basis for the sort of conceptual design of a pattern recognition system for the next four decades. I wanna say four decades. Yeah, pretty much. So, so that model is one by which you take an input, you run it through a feature extractor that is supposed to extract the relevant characteristics of the input that would be useful for the task. So, you want to recognize a face. Can you detect an eye? How do you detect an eye? Well, there is probably a dark circle somewhere. Things like that, right? You wanna recognize a car, you know, are there kind of dark round things, et cetera. So, the problem here, and so what this feature extractor produces is a vector of features which are things that may be numbers or they may be on or off, okay? So, it's just a list of numbers of vector. And you're gonna feed that vector to a trainable classifier in the case of perceptron or a simple neural net. It's gonna be just the system that computes a weighted sum compares it to a threshold. And the problem is that you have to engineer the feature extractor. So, the entire literature of pattern recognition, statistical pattern recognition at least. And a lot of computer vision, at least the part of computer vision that's interested in recognition was focused on this part, the feature extractor. How do you design a feature extractor for a particular problem? You wanna do, I don't know, hangul character recognition. What are the relevant features for recognizing hangul and how can you extract them using all kinds of algorithmic tricks? How do you preprocess the images? You know, how do you normalize their size? You know, things like that. How do you skeletonize them? How do you segment them from their background? So, the entire literature was devoted to this. Very, very little was devoted to that. And what Deep Learning brought to the table is this idea that instead of having this kind of two-stage process for pattern recognition where one stage is built by hand, where the representation of the input is a result of some hand engineered program, essentially. The idea of Deep Learning is that you learn the entire task end to end. Okay, so basically you build your pattern recognition system or whatever it is that you wanna do with it. As a cascade or sequence of modules, all of those modules have tunable parameters. All of them have some sort of non-linearity in them, okay? And then you stack them, you stack multiple layers of them, which is why it's called Deep Learning. So, the only reason for the deep word in Deep Learning is the fact that there are multiple layers. There's nothing more to that, okay? And then you trend the entire thing end to end. So, the complication here, of course, is the fact that the parameters that are in the first box, how do you know how to turn them so that the output goes closer to the output you want? That's what back propagation does for you. Okay, why do all those modules have to be non-linear? It's because if you have two successive modules and they're both linear, you can collapse them into a single linear, right? The product of two linear functions or the composition of two linear functions is a linear function. Take a vector multiplied by a matrix and then multiply it by a second matrix. It says if you had pre-computed the product of those two matrices and then multiply the input vector by that composite matrix. There's no point having multiple layers if those layers are linear, okay? It's actually a point, but it's a minor point. So, since they have to be non-linear, what is the simplest multilayer architecture you can imagine that has parameters that you can tune? It seems like weights in the neural nets and is non-linear and you realize quickly that it has to look something like this. So, take an input. An input can be represented as a vector, right? An image is just a list of numbers. Think of it as a vector. Ignore the fact that it's an image for now. Piece of audio, whatever it is that your sensors or your dataset gives you is a vector. Multiply this vector by matrix. The coefficients in this matrix are the tunable parameters. And then take the resulting vector, right? When you multiply a matrix by a vector, you get a vector. Pass each component of this vector through a non-linear function. And if you want to have the simplest possible non-linear function, use something like what's shown at the top here, which people in the neural nets call the value. People in engineering call this half-wave rectification. People in math call this positive part. Whatever you want to call it, okay? So, apply this non-linear function to every component of the vector that results from multiplying the input vector by the matrix. Okay? Now you get a new vector, which has lots of zeros in it because whenever the weighted sum was less than zero, the output is zero if you pass through the value. And then repeat the process. Take that vector, multiply it by a weight matrix, pass a result through point-wise non-linearity, take the result, pass it, multiply it by a matrix, pass a result through non-linearities. That's a basic neural net, essentially. Okay? Now, why is that called a neural net at all? It's because when you take a vector and you multiply a vector by a matrix to compute each component of the output, you actually compute a weighted sum of the components of the input by a corresponding row in the matrix, right? So this little symbol here, there's a bunch of components of the vector coming into this layer, and you take a row of the matrix, compute a weighted sum of those values where the weights are the values in the row of that matrix, and that gives you a weighted sum. And you do this for every row that gives you the result, right? So the number of units after the multiplication by a matrix is going to be equal to the number of rows of your matrix. And the number of columns of the matrix, of course, has to be equal to the size of the input. Okay, so supervised running in slightly more formal terms than the one I showed earlier is the idea by which you're going to compare the output that the system produces. So you show an input, you run through the neural net, you get an output, you're going to compare this output with a target output, and you're going to have an objective function, a loss module, that computes a distance, a discrepancy, penalty, whatever you want to call it, divergence, okay, there's various names for it. And then you're going to compute the average of that. So the output of this cost function is a scalar, right? It computes the distance, for example, Euclidean distance between a target vector and the vector that the neural net produces, the deep learning system produces. And then you can compute the average of this cost function, which is just a scalar, you're going to average it over a training set, right? So a training set is composed of a bunch of pairs of inputs and outputs, compute the average of this over the training set. The function you want to minimize with respect to the parameters of the system, the tunable knobs, is that average, okay? So you want to find the value of the parameters that minimizes the average error between the output you want and the output you get averaged over a training set of samples. So I'm sure the vast majority of people here sort of have at least some intuitive understanding of what gradient descent is. So basically the way to minimize this is to compute a gradient, right? So it's like you are in a mountain, you're lost in a mountain and it's a very smooth mountain. But there is fog and it's night and you want to go to the village and the valley. And so the best thing you can do is you turn around and you see which way is down and you take a step down in the direction of steepest descent, okay? So this search for the direction that goes down, that's called computing a gradient or technically a negative gradient, okay? Then you take a step down that's taking a step down in the direction of negative gradient. And if you keep doing this and your steps are small enough, small enough so that when you take a step you don't jump to the other side of the mountain, then eventually you're gonna converge to the valley if the valley is convex, which means that if there is no kind of lake, no mountain lakes in the middle where there's kind of a minimum and you're gonna get stuck in that minimum, the valley might be lower but you don't see it, okay? So that's why convexity is important as a concept. But here is another concept which is the concept of stochastic gradient which I'm sure again a lot of you have heard, I'll come back to that in more detail. The objective function you're computing is an average over many, many samples. And you can compute the objective function in this gradient over the entire training set by averaging the value of the entire training set. But it turns out it's more efficient to just take one sample or a small group of samples, computing the error that this sample makes, then computing the gradient of that error with respect to the parameters and taking a step, a small step. Then a new sample comes in, you're gonna get another value for the error and another value for the gradient which may be in a different direction because it's a different sample. And take a step in that direction. And if you keep doing this, you're gonna go down the cost surface but in kind of a noisy way. There's gonna be a lot of fluctuations. So what is shown here is an example of this. This is stochastic gradient applied to a very simple problem with two dimensions where you only have two weights. And it looks kind of semi-periodic because the examples are always shown in the same order which is not what you're supposed to do with stochastic gradient. But as you can see, the path is really erratic. Why do people use this? There's various reasons. One reason is that empirically, it converges much, much faster, particularly if you have a very large training set. And the other reason is that you actually get better generalization in the end. So if you measure the performance of the system on a separate set, that you, I assume you all know the concept of training set and test sets and validation set, but so if you test the performance of the system on a different set, you get generally better generalization if you use stochastic gradient than if you actually use the real, true gradient descent. The problem is, yes? Sir, if it was computationally possible to learn gradient descent on the entire... No, it's worse. So computing the entire gradient or the entire dataset, it is computationally feasible. I mean, you can do it. It's not any more expensive than, you know... Or then it would be less noisy. It would be less noisy, but it would be slower. So let me tell you why. I mean, this is something we're gonna talk about again when we talk about optimization. But let me tell you, I give you a training set with a million training samples, and it's actually a hundred repetitions of the same training sample with 10,000 samples, okay? So my actual sample is 10,000 training samples. And I replicate it 100 times and I claim that I scrambled it. And I tell you, here is my training set with a million training samples. So if you do a full gradient, you're gonna compute the same values 100 times. You're gonna spend 100 times more work than necessary without knowing it, okay? So this only works because of repetition, but it also works in kind of more normal situations of machine learning where you have samples that have a lot of redundancy in them, like many samples are very similar to each other, et cetera. So if there is any kind of coherence, if your system is capable of generalization, then that means stochastic gradient is gonna be more efficient because if you don't use stochastic gradient, you're not gonna be able to take advantage of that redundancy. So that's one case where noise is good for you. Okay, don't pay attention to the formula, don't get scared because we're gonna come back to this in more details. But why is back propagation called back propagation? And again, this is very informal. It's basically a practical application of a chain rule. So you can think of a neural net of the type that I showed earlier as a bunch of modules that are stacked on top of each other. And you can think of this as compositions of functions and you all know the basic rule of calculus, you know how you compute the derivative of a function composed with another function? Well, the derivative of f composed with g is the derivative of f at the point of g of x multiplied by the derivative of g at the point x, right? So you get the product of the two derivatives. So this is the same thing except that the function instead of being scalar functions are vector functions, they get vectors as input and they produce vectors as output. More generally, actually, they take multidimensional arrays as input and multidimensional arrays as output, it doesn't matter. And basically, what is the generalization of this chain rule in the case of kind of functional modules that have multiple inputs, multiple outputs that you can view as functions, right? And basically it's the same rule if you kind of blindly apply them, it's the same rule as you apply for regular derivatives, except here you have to use partial derivatives, but what you see in the end is that if you want to compute the derivative of the difference between the output you want and the output you get, which is the value of your objective function, with respect to any variable inside of the network, then you have to kind of propagate derivatives backwards and kind of multiply things on the way, all right? We'll be much more formal about this next week. For now, you just know why it's called back propagation and because it applies to multiple layers. Okay, so the pictures I showed earlier of this neural net is nice, but what if the input is actually an image, right? So an image, even sort of a relatively low resolution image is typically like a few hundred pixels on the side. Okay, so let's say 256 by 256 to take a random example, okay? The color image 256 by 256. So it's got 65,536 pixels times three because you have R, G and B components to it. You have three value for each pixels. And so that's roughly 200,000 values, okay? So your vector here is a vector with 200,000 components. If you have a matrix that is going to multiply this vector, this matrix is gonna have to have 200,000 rows. Columns, I'm sorry. And depending on how many units you have here in the first layer, there's gonna be a 200,000 by maybe a large number. That's a huge matrix, right? Even if it's 200,000 by 100,000, so you have a compression in the first layer, you know, that's already a lot of a very, very large matrix, billions. So it's not really practical to think of this as a full matrix, right? You're gonna have to do, if you wanna deal with things like images, is make some hypothesis about the structure of this matrix so that it's not a completely full matrix that connects everything to everything. That would be impractical, at least for a lot of practical applications. So this is where inspiration from the brain comes back. There was some work, classic work in neuroscience in the 1960s by the gentleman at the top here, Huber and Wiesel. They actually won the Nobel Prize for this in the 70s, but their work was from the late 50s or the 60s. And what they did was that they poked electrodes in the visual cortex of various animals, you know, cats, monkeys, mice, you know, whatever. I think they like cats a lot. And they tried to figure out what the neurons in the visual cortex were doing. And what they discovered was that, so first of all, well, this is a human brain, but I mean, this charges for much later, but all mammalian visual system is organized in a similar way. You have signals coming into your eyes, striking your retina. You have a few layers of neurons in your retina in front of your photoreceptors that kind of pre-process the signal if you want. They kind of compress it because you can't have the human eye as something like 100 million pixels. So it's like 100 million pixel camera, mega pixel camera. But the problem is you cannot have 100 million fibers coming out of your eye, because otherwise your optical nerve would be this big. And you wouldn't be able to move your eyes. So those neurons in front of your retina do compression. They don't do JPEG compression, but they do compression so that the signal can be compressed to one million fibers. You have one million fibers coming out of each of your eyes. And that makes your optical nerve about this big, which means you can carry the signal and turn your eyes. This is actually a mistake that evolution made for vertebrates. Invertebrates are not like that. Invertebrates have actually, so it's a big mistake because the wires collecting the information from your retina because the neurons that process the signal are in front of your retina. The wires have to kind of be in front of your retina and so blocking part of the view if you want. And then they have to punch a hole through your retina to get through your brain. So there's a blind spot in your visual field because that's where your optical nerve punches through your retina. So it's kind of ridiculous if you have a camera like this to have the wires coming out the front and then dig a hole in your sensor to get the wires back. It's much better if the wires come out the back, right? And vertebrate got that wrong. Invertebrate got it right. So, you know, like squids and octopus actually have wires coming out the back. They're much luckier. But anyway, so the signal goes from your eyes to a little piece of brain called the lateral geniculate nucleus, which is under your brain actually and like at the basis of it. And it does a little bit of contrast normalization. We'll talk about this again in a few lectures. And then that goes to the back of your brain where the primary visual cortex area called V1 is. It's called V1 in humans. And there's something called a ventral hierarchy, V1, V2, V4, IT, which is a bunch of brain areas going from the back to the side. And if you have a temporal cortex right here, this is where object categories are represented. So when you go around and you see your grandmother, you have a bunch of neurons firing that represent your grandmother in this area. And it doesn't matter what your grandmother is wearing, you know, what position she is in, if there is occlusion or whatever, those neurons will fire if you see your grandmother. So they sort of category level things. And those things have been discovered by experiments with patients that had to have their skull open for a few weeks and where people poked electrode and had them watch movies and realized this is neuron that turns on if Jennifer Aniston is on the movie. And it only turns on for Jennifer Aniston. So with the idea that somehow the visual cortex can do pattern recognition and seems to have this sort of hierarchical structure, multi-layer structure, there's also the idea that the visual process is essentially a feed-forward process. So the process by which you recognize every object is very fast. It takes about 100 millisecond. And it's barely enough time for the signal to go from your retina to the infrotemporal cortex. It takes about, you know, it's a few millisecond delays per neuron that you have to go through. 100 millisecond, you barely have time for just, you know, a few spikes to go through the entire system. So there's no time for like, you know, recurrent connections and like, you know, et cetera. Doesn't mean that there are no recurrent connections. There's tons of them, but somehow fast recognition is done without them. So this is called the feed-forward ventral pathway. And this gentleman here, Kuniko Fukushima, had the idea of taking inspiration from when we was all in the 70s and sort of build a neural net model on the computer that had this idea that, first there were layers, but also the idea that Huber and Wiesel discovered that individual neurons only react to a small part of the visual field. So they poked electrode in neurons in V1 and they realized that this neuron in V1 only reacts to motifs that appear in a very small area in the visual field. And then the neuron next to it will react to another area that's next to the first one. Right, so the neurons seem to be organized in West-Colar Retina Topic way, which means that neighboring neurons react to neighboring regions in the visual field. What they also realize is that the groups of neurons that all react to the same area in the visual field and they seem to turn on for edges at a particular orientation. So one neuron will turn on for if its receptive field has a vertical edge and then the one next to it if the edge is a little slanted and then the one next to it if the edge is a little rotated, et cetera. And so they had this picture of V1 basically as orientation selectivity. So neurons that look at a local field and then react to orientations. And those groups of neurons that react to multiple orientations are replicated over the entire visual field. So this guy, Fukushima said, well, why don't I build a neural net that does this? I'm not gonna necessarily insist that my system extracts oriented features, but I'm gonna use some sort of unsupervised running algorithm to train it. So he was not training his system N2N, he was training it layer by layer in some sort of unsupervised fashion, which I'm not gonna go into the details of. And then he used another concept from, so use the concept that those neurons were replicated across his visual field. And then use another concept from Uboni Rizzo called complex cells. So complex cells are units that pool the activities of a bunch of simple cells which are those orientation selective units. And they pool them in such a way that if an orientation, if an oriented edge is moved a little bit, it will activate different simple cells, but the complex cell since it integrates the outputs from all those simple cells will stay activated until the edge moves beyond its receptive field. So those complex cells build a little bit of shifting variance in the representation. You can shift an edge a little bit and it will not change the activity of one of those complex cells. So that's what we now call convolutional pooling in the context of convolutional nets. And that basically is what led me in the mid-80s and late-80s to come up with convolutional nets. So there are basically networks where the connections are local. They are replicated across the visual field and you interspers sort of feature detection layers that detect those local features with pooling operation. We'll talk about this at length in three weeks, so I'm not gonna go into every detail, but it has, it recycled this idea from Uboni Rizzo and Fukushima that if I can get my pointer, that basically every neuron in one layer computes a weighted sum of a small area of the input and the weighted sum uses those weights. Those weights are replicated across, so every neuron in a layer use the same set of weights. Okay, so this is the idea of weight tying or weight sharing. So we're using backprop, we're able to train neural nets like this to recognize average and digits. This is back from the late-80s, early-90s. And this is me when I was about your age, maybe a little older. I'm about 30 in that video. And this is my phone number when I was working at Bell Labs. Doesn't work anymore, it's a New Jersey number. And I hit a key and this is neural net running on the 386 PC with a special accelerator card recognizing those characters, running a neural net very similar to the one I just showed you the animation of. And this thing could recognize characters of any style, including very strange styles, including even stranger styles. And so this was kind of new at the time because this was back when character recognition or pattern recognition in general was still in the model of we extract features and then we turn a classifier on top. And this could basically train the entire, like learn the entire task from end to end. Basically the first few layers of that neural net would play the role of a feature extractor but it was trained from data. The reason why we use character recognition is because this was the only thing for which we had data. The only task for which there was enough data was either character recognition or speech recognition. Speech recognition experiments were somewhat successful but not as much. Pretty quickly we realized we could use those convolutional nets not just to recognize individual characters but to recognize groups of characters, so multiple characters at a single time. And it's because of this convolutional nature of the network, which I'll come back to in three lectures that basically allow those systems to just be applied to a large image and then they will just turn on whatever they see in their field of view, whatever they see, a shape that they can recognize. So basically if you have a large image, you train a convolutional net that has a small input window and you swipe it over the entire image and whenever it turns on, it means it's detected the object that you trained it to detect. So here the system is capable of doing simultaneous segmentation and recognition. Back in the, before that, people in pattern recognition would have an explicit program that would separate individual objects from their background and from each other and then send each individual object, character, for example, to a recognizer. But with this, you can do both at the same time, you don't have to worry about it, you don't have to build any special program for it. So in particular this could be applied to natural images for things like phase detection, pedestrian detection, things like this, right? Same thing, train a convolutional net to distinguish between an image where you have a face and an image where you don't have a face, train this with several thousand examples and then take that window, swipe it over an image, whenever it turns on, there is a face. Of course the face could be bigger than the window so you set sample the image, you make it smaller and you swipe your network again and then make it smaller again, swipe your network again, so now you can detect faces regardless of size. Okay, in particular you can use this to drive robots. So this is things that were done before deep learning became popular, okay? So this is an example where the network here is a convolutional net, it's applied to the image coming from a camera, from a running robot and it's trying to classify every window, a small window like 40 by 40 pixels or so, even less, as to whether the central pixel in that window is on the ground or is an obstacle, right? So whatever it classifies as being on the ground is green, whatever it classifies as being an obstacle is red or purple if it's on a foot of the obstacle and then you can sort of map this to a map which you see at the top and then do planning in this map to reach a particular goal and then use this to navigate. And so these are two former PhD students, Raya Headsell on the right, Pierre Semenet on the left or annoying this poor robot. Pretty confident the robot is not gonna break their legs since they actually wrote the code and trained it. Pierre Semenet is a research scientist at Google Brain in California working on robotics. Raya Headsell is head of robotics research at Director of Robotics Research at DeepMind. Did it pretty well. So a similar idea can be used for what's called semantic segmentation. So semantic segmentation is the idea that you can, again with this kind of sliding window approach, you can train a convolutional net to classify the central pixel using a window as a context but here it's not just trained to classify obstacle from non-abstacles, it's trained to classify something like 30 categories. This is down Washington Place I think. This is Washington Square Park. And it knows about roads and people and plants and trees and whatever but it finds desert in the middle of Washington Square Park which is not, there's no beach that I'm aware of. So it's not perfect. At the time it was state of the art though that was the best system that there was to do this kind of semantic segmentation. I was running around giving talks like trying to evangelize people about deep learning back then. This was around 2010. This was before the kind of deep learning revolution if you want. And one person, a professor from Israel was sitting in one of my talks and he's a theoretician but he was really kind of transfixed by the potential applications of this and he was just about to take a sabbatical and work for a company called Mobileye which was a startup in Israel at the time. Working on autonomous driving. And so a couple of months after he heard my talk he started working at Mobileye and he told the Mobileye people, you know, you should try this convolutional net stuff. This works really well. And the engineers had said, nah, we don't believe in that stuff, we have our own method. So he implemented it and tried it himself and beat the hell out of all the benchmarks they had. And all of a sudden the whole company switched to using convolutional nets. And they were the first company to actually come up with a vision system for cars that can keep a car on a highway and can break if there is a pedestrian or a cyclist crossing. I'll come back to this in a minute. They were basically using this technique, semantic segmentation. Very similar to the one I showed for the robot before. So it's not clear, this was a guide by the name of Shai Shalefschwarz. You have to be aware of the fact also that back in the 80s people were really interested in implementing special types of hardware that could run neural nets really fast. And these are kind of a few examples of neural net chips that were actually implemented, I had to do with some of them, but they were implemented by people working in the same group as I was at Bell Labs in New Jersey. So this was kind of a hot topic in the 1980s. And then of course with the interest in neural net dying in the mid-90s, people weren't working on this anymore until a few years ago. Now the hottest topic in chip design, in the chip industry is neural net accelerators. You go to any conference on computer architecture, you know, chip like ISCCC, which is the big kind of solid state circuit conference, half the talks are about neural net accelerators. And I worked on a few of those things. Okay, so then something happened as I told you around 2010, 13, 15, in speech recognition, image recognition, natural language processing, and it's continuing. We're in the middle of it now for other topics. And what happened, and I'm really sad to say it didn't happen in my lab, but with our friends, we started with Yosha Benjo and Jeff Hinton back in the early 2000s. We knew that deep learning was working really well. And we knew that the whole community was making a mistake by dismissing neural nets and deep learning. And so we didn't use the term deep learning yet. We invented it a few years later. So around 2003 or so, 2004, we started kind of a conspiracy if you want. We got together and we said, we're just gonna try to kind of beat some records and some data sets, invent some new algorithms. Again, it would allow us to train very large neural nets so that we'll collect very large data sets so that we will show the world that those things really work because nobody really believed it. And that really kind of succeeded beyond our wildest dreams. In particular, in 2012, Jeff Hinton had one student, Alex Krzevsky, who spent a lot of time implementing convolutional nets on GPUs, which were kind of new at the time. They were not entirely new, but they were starting to become really high performance. And so it was very good at sort of hacking that. And then they were able to train much larger neural nets, convolutional nets that anybody was able to do before. And so they used it to train on the ImageNet dataset. The ImageNet dataset is a bunch of natural photos. And the system is supposed to recognize the main object in the photo among 1000 different categories. And the training set had 1.3 million samples, which is kind of large. So what they did was build this really large and very deep convolutional net, very much on the same model as what we had before, implemented on GPUs, and let it run for a couple of weeks. And with that they beat the performance of best competing systems by a large margin. So this is the error rate on ImageNet going back to 2010. So 2010 it was about 28% error, top five. So basically you get an error if the correct category is not in the top five among 1000, okay? So it's kind of a mild measure of error. 2011 it was 25.8%. The system that was able to do this was actually very, very large. It was sort of somewhat convolutional net like, but it wasn't trained. I mean, only the last layer was trained. And then Jeff and his team got it down to 16.4%. And then there was a watershed moment for the computer vision committee. A lot of people said, okay, now we know that this thing works. And the whole committee went from basically refusing every paper that had neural nets in them in 2011 and 2012 to refusing every paper that does not have a convolutional net in it in 2016. So now it's the new religion, right? You can't get a paper in your computer vision conference unless you use convolutional net somehow. And the error rate went down really quickly. People found all kinds of really cute architectural tricks that sort of made those things work better. And what you'd seen there is that there was an inflation of the number of layers. So my convolutional nets from the 90s had seven layers or so, and from the early 2000s. And then AlexNet had, I don't know, 12. Then VGG the year afterward, after that had 19. GoogleNet had, I don't know how many because it's hard to figure out how you count. And then the workhorse now of object recognition, the standard backbone, as people call them, has 50 layers. Let's go ResNet 50. But some networks have 100 layers or so. So Alfredo a few years ago put together this chart that shows where each of those blob is a particular network architecture. And the x-axis is the number of billions of operations you need to do to compute the output. Those things are really big, billions of connections. The y-axis is the top one accuracy on ImageNet. So it's not the same measure of performance as the one I showed you before. So the best systems are around 84 today. And the size of the blob is the memory occupancy. So the number of millions of floats that you need to store to store the weight values. Now people are very smart about compressing those things, like quantizing them and these entire teams at Google, Facebook and various other places that only work on optimizing those networks and compressing the things so they can run fast. Because to give you just a rough idea, the number of times Facebook, for example, runs a CompositionNet on its servers per day is in the tens of billions. Okay, so is a huge incentive to optimizing the amount of computation necessary for this. So one reason why CompositionNets are being so successful is that they exploit a property of natural data which is compositionality. So compositionality is the property by which a scene is composed of objects. Objects are composed of parts. Parts are composed of subparts. Subparts are really combinations of motifs and motifs are combinations of contours or edges, or textures. And those are just combinations of pixels. Okay, so there's this so-called compositional hierarchy that particular combinations of objects at one layer in the hierarchy form objects at the next layer. And so if you kind of mimic this compositional hierarchy in the architecture of the network and you let it learn the appropriate combinations of features at one layer that form the features of the next layer, that's really what deep learning is. Okay, learning to represent the world and exploit the structure of the world and the world being the fact that there is organization in the world because the world is compositional. A statistician by the name of Stuart Geeman who is at Brown University said, so he was kind of playing on the famous Einstein quote, Einstein said, the most incomprehensible thing about the world is that the world is comprehensible. Like among all the complicated things that the world could be extremely complicated, so complicated that we have no way of understanding it. And it looks like a conspiracy that we're able to understand at least part of the world. And so Stuart Geeman version of this is that the world is compositional or there is a God because you need supernatural things to be able to understand it if the world is not compositional. So this has led to incredible progress in things like computer vision. As you know from being able to unreliably identify, detect people, to being able to generate masks for every object, accurate masks and then even sort of figure out the pose and then do this in real time on a mobile platform. I mean, the progress has been sort of nothing short of incredible. And most of those things are based on two basic families of architectures, the sort of so-called one pass, object detection recognition architectures called RetinaNet, feature pyramid network, there's various names for it or UNET and then another type of mask or CNN. Both of them actually originated from Facebook or the people who originated them are now at Facebook. They sometimes came up with it before they came to Facebook. But you know, those things work really well. You know, they can do things like that, detect objects that are partially occluded and you know, draw a mask of every object. So basically this is a neural net, a convolutional net where the input is an image but the output is also an image. In fact, the output is a whole bunch of images, one per category and for each category it outputs the mask of the object from that category. Those things can also do what's called instant segmentation. So if you have a whole bunch of sheeps, it can tell you, you know, not just that this region is cheap but they actually pick out the individual sheeps and we'll tell them apart and they'll count the sheeps, right? And for the sleep. That's what you're supposed to do, right? To for the sleep, you count sheeps, right? And the cool thing about deep learning is that a lot of the community has embraced the whole concept that research has to be done in the open. So a lot of the stuff that we're gonna be talking about as you probably know in the class is not just published, but it's, you know, published with code. It's not just code, it's actually pre-trained models that you can just download and run, all open source, all free to use. So that's really new. I mean, people didn't use to do research this way, particularly in industry, but even in academia, people weren't used to kind of distributing their code. But deep learning has sort of, somehow the race has kind of driven people to kind of be more open about research. So there's a lot of applications of all this. As I said, you know, sovereign card, this is actually a video from Mobileye. And Mobileye was pretty early in this, using convolution nets for autonomous driving. To the point that in 2015, they had managed to shoehorn a convolution net on the chip that they had designed for some other purpose. And they sold the license, the technology to Tesla. So the first cell driving Tesla's, I mean, cell driving, not really cell driving, but they're driving assistance, right? They can keep in lane on the highway and change lane, had this Mobileye system. And that's pretty cool. So that's a convolution net. It's a little chip that is, you know, just behind the, it looks out the window and it's behind the rear view mirror. Since then, this, you know, four or five years ago, since then, this kind of technology has been very widely deployed by a lot of different companies. Mobileye still now was bought by Intel and they have like 70 or 80% of the market for those vision systems. But there is a lot of companies and car manufacturers that use those things. So in fact, in some European countries, every single car that comes out, even low-end cars, has the convolution net-based vision systems. And they call this emergency, sort of advanced emergency braking system or automated emergency braking system, AEBS. It's deployed in every car in France, for example. It reduces collisions by 40%. So not every car on the roads have them yet because, you know, people keep their cars for a long time. But what that means is that it saves lives. So a very positive application of deep learning. Another big category of applications, of course, is medical imaging. So it's probably the hardest topic in radiology these days is how to use AI, which means convolution nets for radiology. This is lifted from a paper by some of our colleagues here at NYU, where they analyzed MRI images. So there's one big advantage to convolution nets is that they don't need to look at a screen to look at an MRI. In particular, to be able to look at an MRI, they don't have to slice it into 2D images. They can look at the entire 3D value. This is one property that this thing uses. It's a 3D convolution net that looks at the entire volume of an MRI image. And then produces, you know, it uses the very similar technique for, as I was showing before, for symmetric segmentation. And it produces, it basically turns on in the output image wherever there is some, you know, here, a femur bone. But, you know, it could be... So this is the kind of result it produces. And it works better in 3D than in 2D slices. Or it can turn on when it detects medical tumor in mammograms. This is 2D, it's not 3D. And there's, you know, various other projects in medical imaging that are going around. Okay. Lots of applications in science, in physics, bioinformatics, you know, whatever, which we'll come back to. Okay. So there's a bunch of mysteries in deep learning. They're not complete mysteries because people have some understanding of all this, but there are mysteries in the sense that we don't have, like, nice theory for everything. Why do they work so well? So one big question that theoreticians were asking many years ago when I was trying to convince the world that deep learning was a good idea was that they would tell me, why are you going to approximate any function with just two layers? Why do you need more? And I'll come back to this in a minute. What's so special about convolutional nets? You know, I talked about the compositionality of natural images, of natural data, in general. It's true for speech also and various other signals, natural signals, but it seems a little contrived. How is it that we can train the system despite the fact that the objective function we're minimizing is very non-convex? It may have lots of local minima. This was a big criticism that people were throwing at neural nets. People who had never played with neural nets were throwing at neural nets back in the old days. Say, like, you know, you have no guarantee that your algorithm will converge, you know. This is too scary, I'm not going to use it. And the last one is why is it that the way we train neural nets breaks everything that every textbook in statistics tells you? Every textbook in statistics tells you if you have n data points, you shouldn't have more than n parameters because you're going to overfit that crazy. You know, you might regularize. If you are a Bayesian, you might throw a pyre, but which is equivalent. But what guarantee do you have? And with neural nets, neural nets are widely over-parameterized. We train neural nets with hundreds of millions of parameters routinely. They're used in production. And the number of training samples is nowhere near that. How does that work? But it works. OK, so things we can do with deep learning today. You know, we can have safer cars. We can have better medical analysis, medical image analysis systems. We can have pretty good language translation, far from perfect but useful, stupid chatbots, very good information search retrieval and filtering. Google and Facebook nowadays are completely built around deep learning. You take deep learning out of them and they crumble. And lots of applications in energy management and production and all kinds of stuff, manufacturing, environmental protection. But we don't have really intelligent machines. We don't have machines with common sense. We don't have intelligent personal assistants. We don't have smart chatbots. We don't have household robots. I mean, there's a lot of things we don't know how to do, right? Which is why we still do research. OK, so deep learning is really about learning representations. But really, we should know in advance what representations are. So I talked about the traditional model of pattern recognition. But representation is really about your raw data. You want to turn it into a form that is useful somehow. Ideally, you'd like to turn it into a form that's useful regardless of what you want to do with it. Sort of useful in a general way. And it's not entirely clear what that means. But at least you want to turn it into a representation that's useful for the tasks that you are envisioning. And there's been a lot of ideas over the decades on sort of general ways to preprocess natural data in such a way that you produce good representations of it. And I'm going to go through the details of this laundry list. But there are things like tidying the space, doing random projections. So random projection is actually kind of like a monster that rears its head periodically like every five years, and you have to whack it on the head every time it pops up. That was the idea behind the perceptron. So the first layer of perceptron is a layer of random projections. What does that mean? A random projection is a random matrix which has a smaller output dimension and input dimension with some sort of nonlinearity at the end. So think about a single layer neural net with nonlinearities. But the weights are random. So you can think of this as random projections. And a lot of people are rediscovering that wheel periodically claiming that it's great because you don't have to do multi-layer training. And so we started with the perceptron. And then it came back in the 60s, and then it came back again in the 1980s, and then it came back again. And now it came back. There's a whole community mostly in Asia. They call two-layer neural nets where the first layer is random. They call this extreme learning machines. And it's ridiculous, but it exists. They're not extreme. I mean, they're extremely stupid, but you know. Right, so I was mentioning the compositionality of the world from pixels to edges to text on motifs, parts, objects. In text, you have characters, word, word groups, clauses, sentences, stories. In speech, it's the same. You have individual samples. You have spectral bands, sound, phonemes, words, et cetera. You always have this kind of hierarchy. OK, so here are many attempts at dismissing the whole idea of deep learning. First thing, and this is things that I've heard for decades from mostly theoreticians, but a lot of people. And you have to know about them because they're going to come back in five years when people say, oh, deep learning sucks. Why not use super vector machines? So here are super vector machines here on the top left. Super vector machine is a, and I'm sure many of you have heard about kernel machines and super vector machines, who knows what this is? I mean, even if it's a rough idea what this is, a few hands. Who has no idea what a super vector machine is? Don't be shy, yeah? I mean, it's OK if you don't. Like most people haven't raised their hands for either. OK, come on, all the way up. Cool, all right. Who has no idea what it is? Don't be shy. It's OK. And that was the idea what super vector machine are? All right, good. Right, so here's a super vector machine. Super vector machine is a two-layer neural net. It's not really a neural net. People don't like when it's formulated this way, but really you can think of it this way. It's a two-layer neural net where the first layer, which is symbolized by this function k here, each unit in the first layer compares the input vector x to one of the training samples, xi's. OK, so you take your training samples. They say you have 1,000 of them. So you have 1,000 xi's from i equal 1 to 1,000. And you have some function k that is going to compare x and xi. A good example of a function to compare the two is you take the dot product between x and xi and you pass the result through exponential minus square or something. So you get a Gaussian response as a function of the distance between x and xi. OK, so it's a way of comparing two vectors. Doesn't matter what it is. And then you take those scores coming out of this k function that compares the input to every sample and you compute a weighted sum of them. And what you're going to learn are the weights, the alphas. OK, so it's a two-layer neural net in which the second layer is trainable. And the first layer is fixed, but in a way you can think of the first layer as being trained in an unsupervised manner because it uses the data from the training set, but it only uses the x's, doesn't use the y's. It uses the data in the stupidest way you can imagine, which is you store every x and use every single x as the weight of a neuron, if you want. OK, that's what a super vector machine is. You can write 1,000 page books about the cute mathematics behind that, but the bottom line is it's a two-layer neural net where the first layer is trained in a very stupid way unsupervised and the second layer is just a linear classifier. So it's basically glorified template matching because it basically compares the input vector to all the training samples. And so it doesn't work if you want to do computer vision with raw images. If x is an image and the x's are a million images from ImageNet, first of all for every image you're going to have to compare it with a million images or maybe a little less if you're smart on how you train it, that's going to be very expensive. And the kind of comparison you're making is basically what solves the problem. The weighted sum you're going to get at the end is really cherry on the cake. I use that analogy too often, actually. So you can approximate. You can have theorems that show that you can approximate any function you want as close by tuning the k function and the alphas. And so if you talk to a theoretician, they'll tell you, why do you need deep learning? I can approximate any function I want with a kernel machine. The number of terms in that sum can be very large and nobody tells you what kernel function you can use. And so that doesn't solve the problem. You can use a two-layer neural net. So this is the top right here. The first layer is a nonlinear function f applied to the product of a matrix that we use 0 by input vector and then the second layer multiplied by a second matrix and then passes it through another nonlinearity. So it's composition of two linear and nonlinear operations. Again, you can show that under some conditions you can approximate any function you want with something like this, given that you have a large enough vector in the middle. So the dimension of what comes out of the first layer, if it's high enough, potentially infinite, you can approximate any function you want as close as you want by making this layer go to infinity. So again, you talk to theoreticians and they tell you, why do you need layers? I can approximate anything I want with two layers. But there is an argument which is it could be very, very expensive to do it in two layers. And for some of you, this may sound familiar. For most of you, probably not. But let's say I want to design a logic circuit. So when you design logic circuits, you have N gates and OR gates or NAND gates. You can do everything with just NANDs, negative NANDs. And you can show that any Boolean function can be written as a bunch of ORs and a bunch of NANDs and then an OR on top of this. That's called disjointed normal form, DNF. So any function can be written in two layers. The problem is that for most functions, the number of terms you need in the middle is exponential in the size of the input. So for example, if I give you N bits and I ask you to construct a circuit that tells me if the number of bits that are on in the input string is even or odd. It's a simple Boolean function, 1 or 0 on the output. The number of gates that you need is essentially exponential in the middle if you do it in two layers. If you allow yourself to do it in log N layers where N is number of input bits, then it's linear. So you go from exponential complexity to linear complexity if you allow yourself to use multiple layers. It's as if when you write a program, I'll tell you write a program in such a way that there is only two sequential steps that are necessary to run your program. So basically your program has two sequential instructions. You can run as many instructions as you want in your program, but they have to run in parallel, most of them. And you're only allowed two sequential steps. OK? And the kind of instructions you have access to are things like linear combinations, nonlinearities, like simple things, right? Not like entire sub-programs. So for most problems, the number of intermediate values you're going to have to compute in the first step is going to be exponential in the size of the input. There's only a tiny number of problems for which you're going to be able to get away with a non-exponential number of interns. But if you allow your program to run multiple steps sequentially, then all of a sudden, it can be much simpler. It will run slower, but it will take a lot less memory. It will take a lot less stuff, resources. So people who design computer circuits know this, right? You can design, for example, a circuit that adds two binary numbers. And there is a very simple way to do this, which is that you take the first two bits, you add them, and then you propagate the carry at the second bit, the second pair of bits, taking the carry into account. That gives you the second bit of result, and then propagate the carry, and then you do this sequentially, right? So the problem with this is that it takes a time that's proportional to the size of the numbers that you're trying to add. So circuit designers have a way of basically pre-computing the carry that's called carry look ahead, so that the number of steps necessary to do an addition is actually not n. It's much less than that. But that's at the expense of a huge increase in the complexity of the circuit, the area that it takes on the chip. So this exchange between time and space or between depth and time is known. So what do we call deep models? So a two-layer neural net, one that has one hidden layer. I don't call that deep, even though it technically uses backprop, but it doesn't really learn complex representations. So this is the idea of hierarchy in deep learning. SVN definitely are not deep. Unless you learn complicated kernels, but then there are no ZMs anymore. So what are good features? What are good representations? So here's an example I like. There is something called a manifold hypothesis, and it's the fact that natural data, so if I take a picture of this room with a camera with 1,000 by 1,000 pixel resolution, that's 1 million pixels, that's 3 million values, you can think of it as a vector with 3 million components. Among all the possible vectors with 3 million components, how many of them correspond to what we would call natural images? We can tell when we see a picture, whether it's a natural image or not. We have a model in our visual system that tells us this looks like a real image. And we can tell if it's not. So the number of combinations of pixels that actually are things that we think of as natural images is a tiny, tiny, tiny, tiny, tiny subset of the set of all possible images. There's way more ways of combining random pixels in nonsensical images than there are ways of combining pixels into things that look like natural images. So the manifold hypothesis is that the set of things that look natural to us live in a low dimensional surface inside the high dimensional ambient space. And a good example to convince ourselves of this, imagine I take lots of pictures of a person making faces. So the person is in front of a white background, her hair not moving, and she kind of move her head around and makes faces, et cetera. The set of all images of that person, so I take a long video of that person, the set of all images of that person lives in a low dimensional surface. So the question I have for you is what's the dimension of that surface? Order of magnitude, OK? Any guess? Yeah, you've probably heard much field before, but. What did the person say? Huh? What did they say? OK, so for whoever hasn't heard this, you have a shot. Another shot at an answer, OK? Any guess? No? You'll be shy. I want multiple proposals. Anyone? You can look down at your laptop, but you know, can point at you or something. OK, any idea? Yes? No idea? It's OK? You, any idea? Maybe you heard what he said. Linear, what does that mean? It's a 1D space, OK? A one dimensional surface, OK? Any other proposal? Any idea? OK, the images I'm taking are a million pixels. OK, so the ambient space is 3 million dimensions. Any idea? They don't change, no. And the person can move the head, you know, turn around, things like this, but not really move the whole body. I mean, you only see the face. It's mostly centered. A thousand, OK? Why? OK, yeah? That's a good guess, at least the motivation. Say again? The surface area of the person. The surface area of the person, right? So it's bounded by the number of pixels occupied by the person. That's for sure. That's an upper bound. Yes? Those pixels, of course, are not going to take all possible values. So that's a wider problem. Any other idea? OK, so basically the dimension of that, as you said, is bounded by the number of muscles in the face of the person. The number of degrees of freedom that you observe in that person is the number of muscles in the face. The number of independently movable muscles. So there's three degrees of freedom due to the fact that you can tilt your head this way, that way, or that way. That's three, right there. Then there is translation. This way, that way. Maybe this way and that way. Maybe up or down. That's six. And then the number of muscles in your face, right? So you can smile. You can, you know, pout. You can do all kinds of stuff, right? And you can do this, you know, independently, right? You can close one eye. You can smile in one direction. You know, I mean, so whatever independent muscles you have, not counting the tongue because there's tons of muscles in the tongue. And that's about 50. Maybe a little more. So regardless, it's less than 100. OK? So the surface, locally, if you want to parameterize the surface occupied by all those pictures, move from one picture to another, it's a surface with less than 100 parameters that determine the position of a point on that surface. Of course, it's a highly nonlinear surface. It's not like this beautiful Calabi-Au manifold here, but it is a surface nonetheless. Of course, the answer was in the slide, so, you know. So what you'd like is an ideal feature extractor to be able to disentangle the explanatory factors of variation of what you're observing, right? So the different aspects of my face, you know, is not just I move my muscles and I move my head around. Each of those is an independent factor of variation. I can also remove my glasses. You know, the lighting could change. That's another set of, you know, variable variables. And what you'd like is a representation that basically individually represents each of those factors of variations. So if there is a criterion to satisfy in learning good representations, it's that. It's finding independent explanatory factors of variation of the data that you're looking at. And the bottom line is that nobody has any idea how to do this. Okay? But that would be the ultimate goal of representation learning. And we basically are at the end. Okay. I'll take two more questions if there is any. Yes? Is there some form of like preprocessing? I'm assuming like PCA or something? We're like trying to find it. Okay. So the question is, is there some sort of preprocessing like PCA that will find those vectors? Yeah, so PCA will find those if the manifold is linear. So if you assume that the surface occupied by all those examples or faces is a plane, then PCA will find the dimension of that plane, principal component analysis, right? But no, it's not linear, unfortunately, right? Let me, yeah, let me give you an example. If you take me and my oldest son that looks like me and you place us making the same face in the same position, the distance between our images would be relatively small even though we're not the same person. Now if you take my face and my face shifted by 20 pixels, there's more distance between me and myself shifted than there is between me and my son, okay? So what that means is that the manifold of my face is some complicated manifold in that space. My son is a slightly different manifold which does not intersect mine. Yet those two manifolds are very close to each other and they're closer to each other than any two samples from my manifold or any two samples from his manifold. So PCA is not going to tell you anything, basically. Okay, here is another reason why that surface is not a plane. You're looking at me right now. Now imagine the manifold, which is a linear manifold, wind dimensional manifold, of me turning my head all the way 360, okay? That manifold is topologically identical to a circle. It's not flat. It can't be aligned. So PCA is not going to find it. Okay, I've got to blast off. Thanks. See you next week.