 Okay, just just a few words because Sanjay actually did all the work on now. He has to work some more to actually give the talk Although so my name is still on there, but I didn't really do that much the So this is the buzzword by itself and we had to fulfill the need of everyone to Sorry Everyone to actually well Be satisfied with the program here So we had to bring something but we also the two of us. We don't like to have some Just some talk, which doesn't have any content So yeah, actually will be overwhelmed with information about what it actually means how these things work, etc Etc. So if you in my opinion also, I think I'm speaking for Sanjay as well So if you just would apply some program out there some library out there which uses neural networks And you really don't understand what it means You're in a lot of trouble because it will produce output, but you won't understand it won't make any sense We don't still even the researchers today don't really know everything about neural network In fact, we don't know very little why and how they actually represent information and so on that's that's ongoing research So you have to be careful. You cannot just blindly apply this stuff so take the information which you can get here today as A little bit of an insight as to how these things work and how you can approach the problem and how you can perhaps take some first steps And try to familiarize yourself with the concepts. That's the most important thing don't just jump in immediately come completely and There are so many different of Implementations and techniques out there you have to use the right one. You cannot just force your Problem on a specific implementation and screw around with it Until it actually works that does not work for you. So you will get results, but will be garbage So it's important to listen. So and Sanjay now has more than 90 minutes of fun for you Right. Good afternoon everyone. Thanks for coming. I know it's Sunday and it's two hours I'm not really going to stop in between. So if you have questions, just interrupt me. Just put your hands up and Like we said the idea is not to cover everything about neural networks You can't reasonably do that in two hours So you have to skip a lot of material and the focus here is on the guts of a neural network How's it trained? How does it work? Some examples on toy datasets to understand what it actually does What we won't cover which hopefully we'll cover in future classes or future lectures are Applications so convolutional neural nets for vision recurrent neural networks for language And then there are some more more modern techniques But this will focus purely on the guts of neural networks and training and And before we dive into neural networks, I want to give some context So deep learning is a subset of machine learning and machine learning is basically the idea that you use statistics to Understand how a system operates So this is the the mental model we use So, you know in in science in engineering we study systems And a system is basically something like a black box that has some inputs coming in and it does something There are some rules or some functions and there's some output So you can think of the solar system and gravitation Right, you have some planets There's something going on and they the output is the motion of or the positions and velocities of these planets Another example is a biological cell. That's a very complex system A third one is how memory works how your processor works, right or the network subsystem works These are all things which are either very complex or we don't even Potentially know what the rules are but you can put inputs in and you can measure outputs And so the central question here is Can we find out what that function what the rules or the laws are and We have a very good method for doing this, right? It's the scientific method and it's very simple says take some inputs get the outputs Stare at it try and guess what the rules are Use your guess to predict the outputs on new inputs and then measure the outputs, right? So you have a prediction and you have the actual value if they agree You might be right that you can never be sure but you might be right if they disagree you are definitely wrong That's the scientific method in a nutshell Of course, there are some subtleties so one you can never prove that anything in science is correct and Proof I mean a mathematical proof doesn't exist the Sun may not rise tomorrow, right? I there's an overwhelming possibility that it will probability that it will But who knows maybe it will explode and we don't understand something about the Sun and it disappears The other thing is when I said compared predictions to the outputs. That's a subtle argument, right? Like what's a comparison you have numbers coming out So usually what you do is you set tolerance limits you say I I can predict the answer to within 20% If I wanted to within 1% I have to do a lot more work And so that's the the soft nature of comparing predictions to outputs and the last thing is more Philosophical you want your guess of the rules the laws to be simple You don't want it incredibly complicated now one can argue why Because what we found empirically, you know in physics and biology and chemistry is Simple models tend to be the most generalizable. They apply to the most phenomena. I guess and So the central question is how do I guess the rules? Of course, I can randomly guess them The probability I'll guess the correct ones is zero basically So can we come up with a method to actually? systematically go and figure out what rules are the right ones and I Have a bias towards physics. So I put three different physicists up there So Einstein Dirac Feynman, of course, and many more and the reason I put them there is they all had slightly distinctive styles Right Einstein used to look for consistency of the models or the theories he was creating Feynman used to guess a lot, right? He had very good intuition. He would guess he would look at data and he would do it his way Dirac was super mathematical Everyone does it a different way the the key point is as long as you can guess the right rules. You're good And one way to do this is to actually look at data You say look, I don't want to do all the mathematics. I don't have intuition Let me just collect a lot of inputs and a lot of outputs and stare at them or scan them and See if I can figure out exactly what's going on. How do the outputs get computed? And this was done for example by Kepler Kepler observed the motion of planets and came came up with the three planetary laws purely from data So why are we talking about all this right scientific method systems? Because that's what machine learning does like machine learning collects a lot of data and you have algorithms that systematically scan that data and try different rules or guesses and they try to converge towards the rules that best explain your data So the central problem moves from Let me try and guess the rules to let me come up with efficient algorithms to try many many combinations and See if I can systematically scan this whole space and find the right set of rules or the right candidates And and these algorithms in a traditionally in complexity theory you look at time performance You look at space performance in machine learning. You add a third component, which is how much data do I need? Do I need? You know n squared data to get a certain amount of precision or do I need log n the amount of data? And that becomes an important third question So these are terms that I'll throw around everyone in machine learning throws around all the time And so I just want to define them. So if you hear this, they are completely demystified What is a model? A model is your guess for the rules can be right. It can be wrong What is learning? I mean no one really thinks that this is how human beings learn all these techniques learning in this case refers to the training process the Systematic scanning of the rules that's called learning and Then roughly there are three categories in machine learning. There's supervised learning unsupervised learning reinforcement learning Supervised learning is where you get continuous feedback. You send an input to your system. You get an output You look at your model. You update it and you keep repeating unsupervised is you get no feedback You just have some data. It can be just the outputs. I dump a bunch of outputs don't tell you what the inputs are and Those techniques are used to actually learn the structure of what's actually going on they're weaker and That's a much harder problem unsupervised learning and there's an intermediate case which is reinforcement learning Which is the idea of so you you get infrequent feedback and the way it's generally stated is you have an agent in an environment so think of a robot in this room and That agent is trying to maximize some reward over time So maybe you know There's some very good chocolate in that corner and there's some very cheap and bad chocolate here and the agent can eat The chocolate here or it can spend time and effort to find the one that maximizes pleasure That that whole setup is reinforcement learning the reason I state these here is deep learning is used in all three But it's not the only solution. There are classical techniques, too But there's a lot of progress in applying deep learning to reinforcement learning now Okay, and I so that that's you know machine learning in a nutshell, right? How do we guess the rules now when you do machine learning in deep learning and other techniques you often see probabilities and The question is why like why is probability showing up here? Why does it show up in statistics? So again, let's go back to our system. You have n inputs You have an output and this is we are trying to guess the rules as usual And now there can be several problems n might be huge So think of financial markets. You want to predict red hats price tomorrow Red hats price gets affected by a whole sequence of events around the world right some are small some are large and You can't possibly list every single thing that will affect someone's stock price. There are too many factors Right oil prices might go up because something happens in Saudi Arabia that affects Walmart. Who knows? So you can't list all the inputs Even if you can list all of them some of them might have very tiny effects So you might not think it's actually important and you might drop it Or you have multiple inputs which are highly correlated but not perfectly correlated So there are these small independent components that each effect has and you might miss those but the the core point in the nutshell is most systems have too many inputs to list and to measure and So what you are trying to actually do is you have n inputs x1 through xn you have the output y This is just different notation. So these are the inputs. That's the output and you have this function that you're trying to guess But you don't know all the inputs So what you are actually doing is something like this you have three inputs one output and you're trying to build an approximation to the rules and Of course, you don't have all the input data. So you can only get an approximation And so you end up in a situation like this Imagine a box and imagine there's a there's a dashboard There's a screen that prints a number and you have three knobs x1 x2 x3 and those are the only three knobs Right, that's the only thing that affects that box So you go and you said x1 to 3 x2 to 2 x3 to 5 Press a red button and it pops out three You come back tomorrow. You do the same thing again. You do three to five again You pressed a button you get three again, right? There are some rules. It's doing some computation always prints the same answer and So that's a deterministic system. That makes us happy. I know what I can figure out what's going on But imagine now you have a different box. It still has three knobs, but I cheated I didn't tell you that there are seven other knobs on the other side and you can't see it And these knobs are lose right they keep moving around So you say, okay, I'll play the same game Fix x1 x2 x3 on the knobs press that red button and you get the answer three You go home. You come back tomorrow. You repeat the same thing three to five press the button. Now you get four So what I got three yesterday. Why am I getting four? So I'll come back tomorrow try this again and this time you get three again And so you do this for ten days and on six days you get three on three days You get four on one day you get one And you say what's going on my inputs are completely fixed. I'm changing nothing, but I get different answers and The answer to that that puzzle is well, you had no control over the seven other knobs. They're changing So the system actually sees its inputs change. It's saying three inputs are always fixed but the other seven are doing something else and so because the inputs are changing you get different answers but you can't see those seven inputs and So that means if you have incomplete knowledge about a system if you don't know all the inputs you always end up with probabilities you get some variance in the results and So in your mind if you see any system and you say look I can't completely describe this you will end up doing something That's probabilistic. That's why almost every machine learning technique has probability built in or there are ways to add probability on the top to always give you estimates and uncertainties and This is good Right if this was not the case if we didn't know how to work with probabilities Imagine how much how many systems you could study almost none almost everything has too many factors You're trying to predict memory usage because of some process and it said well, no, I need to know every single thing I need to know the architecture of the processor. I need to know exactly what metal was used in the You know what what type of silicon what metal was used in the wires all that stuff you'll never get anywhere So you say you know what I can't control that that's a problem that adds some probabilities some incomplete knowledge I can still work with this so finally why you're here, so that's machine learning and probability in a nutshell and Now what's a neural network? So we said earlier that we need to come up with algorithms that can efficiently scan the guess for the rules And there are many techniques you might have heard of many of these so you have random forests You have decision trees. You have gay means clustering you have Gradient boosted trees all kinds of fancy names Neural networks are just one more of these it's another way of efficiently scanning your search space to try and figure out the rules So why focus on these right? Why not the other ones because they have had spectacular success in computer vision tasks in the last seven eight years? language tasks or automatic translation of languages speech to text and Even like I said reinforcement learning so you can take Atari games and train an agent to play those games very well Of course, there are limitations to and this is an area of active research And the key thing is they're very flexible learners in other words. They can learn many kinds of rules You can give it many different systems and it will figure out what the rules are and of course these are slightly less important points for application oriented interest But there are many open research problems both on a mathematical level and on an applied level And there's we won't talk about this in this class in this lecture But there's been new progress in something called generative adversarial networks Basically, you can learn the correlational structure of high-dimensional data very well people also have been using these for Discrete optimization combinatorial optimization. That's generally a bad idea We have very efficient techniques for doing that, but it's still an interesting research problem Can a neural network actually do high-dimensional optimization like traveling salesman problems and of course as you know There are many other applications visualization of high-dimensional data many other things So what is a neural network the neural part gives you a hint it comes from neuroscience The idea is our brain has these cells called neurons and I know nothing about biology, but it looks like that so I think of myelin sheath has insulation for example, so I have like electrical analogies for all this and This will make a neuroscientist cringe We really don't go deep into what neuroscience is right most people who work in this field don't know anything about that So you abstract this out you say a neuron is a is a circle which takes an input and gives me an output And by input and output I mean it takes a number and gives me a number More precisely it takes a number that Comes in it compares it to a threshold that's tied to the neuron It if the number is higher than the threshold it outputs a one if the number is less than a threshold it outputs a zero So it's a switch right it has a threshold if you give me something higher I'll give you one if you give me something lower. I'll give you a zero of course. That's too restrictive You don't want just one input so instead you say what if I have n inputs? I went through I n so if you give me n inputs and I still want one number to come out What do I do so you say look I already tied a threshold to the neuron to the Sometimes it's called a neuron. Sometimes it's called a node. So if I say either one what I mean is the same thing So I already have this threshold tied to it Why don't I also put weights on all these incoming edges weights are numbers, right? So these are these W's so you say take each input multiplied by a weight So you have w1 times i1 do it for each one add it up So it's a weighted sum or a linear combination and Instead of comparing the original input Let me compare this whole number that I just generated to the threshold So I take this weighted sum of all the inputs compare it to my threshold if it's higher I give you a one if it's lower I give you a zero and then just put them in a network right connect a bunch of these So the structure here is you have four inputs Imagine just one of these nodes all it sees is four incoming lines And it has weights on those and it takes each input multiplies it by a weight adds it up Compairs to a threshold that it's that it is tied to The second neuron does the same thing the third one does the same thing and Then the last one there are four outputs. They do the same thing But instead of looking at the inputs they look at what these three neurons send it So each neuron looks back says who are my inputs take a weighted sum Compared to threshold and send my output out that's basically a neural network and The data flows from left to right the inputs come from the left does all this stuff sends outputs out and This is often called the feed forward architecture because you're feeding data forward It's it's put in these boxes these layers. So each Vertical line is called a layer. This is the input layer because all the input comes there This is often called the hidden layer. This is called the output layer and deep learning The deep word comes from many hidden layers. So modern neural networks have tens or hundreds of hidden layers nowadays So you look at this and you say this is some arbitrary game. You just Invented some neuron you just put them in a network and you tell me, you know throw data in and see what happens Why is this useful? Right? Why can't I just go and do regression? Why can't I do random forests and The answer to that is something called the universal approximation theorem What it loosely says is Given any reasonable function and we'll define that more precisely but given any reasonable function f and remember f is always Eventually our guess for the rules right the laws that the system obeys but in this case it's some mathematical function I can approximate it to an arbitrarily high accuracy by using a neural network So every machine learning algorithm has its restrictions. It can't learn every function Neural networks are very flexible. So they can learn almost anything with as high an accuracy as you want and Again, this is not meant to be notational overload. It's just This is reference what it says is if You take a function f right the one at the bottom the one that I'm trying to learn and let's assume for Simplicity that instead of taking any real input and mapping it to a real output. It takes n numbers between zero and one That's what this notation means the zero one zero one means in the interval between zero and one inclusive of n points The race to n means you have n such numbers So I've n inputs each one is between zero and one and the output is a real number So that's what that r means r is for real numbers and the only assumption I'm going to make on this function is it's continuous right continuous is Again mathematicians would cringe at this but think of some function that I can draw on a piece of paper without lifting my pen As simple as that right, so I shouldn't be able to lift my pen. It's all nice and continuous The other assumption we'll make is that I'm given some other functions sigma Which is fixed and it's it has really almost no assumptions except that it's nice and continuous It's bounded that means it can only give me values between two ranges it can never go below the lower bound and more than the upper bound and It's not constant right constant functions are boring. I can do nothing with constants so two ingredients the function I'm trying to approximate and There's this special function sigma that is fixed as long as it's continuous not a constant and bounded between two values I can do something pretty remarkable. This was actually a major result in the I think 1950s and then it's been Improved over time and the result is basically Given these two ingredients I can find the weights on the edges in the neural network So the W's are the weights that go from the input layer to the hidden layer The U's are the weights that go from all the neurons in the hidden layer to the output layer The N is the number of nodes I put in the hidden layer and the B is basically a proxy for the threshold Remember we said every node has a threshold Technically B is called a bias. It's negative the threshold, but basically it plays the role of a threshold Right, so I can find these weights thresholds and an appropriate number of neurons So that this function F So this is absolute value of the error, which is the value of the function I'm trying to learn minus this is basically the output of the neural network can be made Smaller than epsilon where epsilon is some number you choose So it's saying given a reasonable function F I can use my neural network to make the error between the output of the neural network and the function as small as I want Which is good news, right? This means I can construct neural networks and learn any function. I want And That's what we want to do. That's the function we want to approximate that the function is the rules for the system So now there are two disclaimers here The first one is that you can't have What's called a linear function that activation the sigma the bounded non-constant function It can't just be a constant it it can't just be linear and we'll see what linear means The other one is nothing is free, right? So I said can approximate that function to arbitrarily high accuracy But there's a price you pay the price is if you want to make that accuracy even smaller that epsilon even smaller You have to add exponentially more nodes in the hidden layer So if you want an accuracy of one, maybe I need hundred nodes You want an accuracy of point one you might need a thousand nodes or ten thousand nodes, right? It can grow exponentially and that of course has computational cost storage cost learning cost it takes more time to learn So again summarizing what our neuron looks like now. It still has n inputs It still has weights you still take a weighted sum But instead of sending remember we were doing the thresholding compared to a threshold zero and once instead of that You feed it to this non-linear activation function here the sigma So you take this weighted sum you feed it to sigma it gives you some number. That's the output So I said, you know the activation function cannot be linear. What does linear mean linear means in one dimension? It's a straight line. So it's a function of the form c1x plus c2 Right does nothing if you plot it. It's a straight line in two dimensions. It plots a plane in three It's a hyper plane and so on and The reason you don't want a linear function is because of this technicality, which is if you give me a number I apply a linear function to it and I apply another linear function to it Maybe a third linear function to it. It has no effect. I just keep getting a Linear function as the output or more formally if I'm given two input functions f and g which are both linear and I compute this this basically means take x apply the function f to it and then apply the function g to it and What you get is something like even x plus e2 where even an e2 are constants So really there's no point in trying to take these linear functions and put them one after the other one You might as well use one and so you want something that's Nonlinear, which is not like that. It's not a simple function And you have seen non-linear functions all the time square roots logs exponentials powers of like monomials right x-rays to 55 is not linear And the reason you know the abstract reason why you want non-linear functions is Almost every system that we work with is non-linear, right? It's not straightforward It's doing the square root of something the log of something something more complex So this is now notation because if you do any work in deep learning you'll see this everywhere, right? Listen, this is linear algebra it just shows up everywhere and Often people who are trying to actually get into deep learning or machine learning get stuck because they see notation That's not familiar. There's a I don't know what this is. This sounds too fancy, but it's not it's just notation So instead of writing the the linear sum there W1 w2 all this stuff you often write it in the shorthand and what w is it's a vector What I is it's a vector w is the weights vector, which looks like this w1 through wn I is the input vector which looks like I one through I n just put them You know in a Python list basically a NumPy array and you do a dot product between them Which is what this is the t stands for transpose Transpose takes a column turns it into a row or takes a row turns it into a column So what this says is it says take my weights vector turn it into a row Take my input vector, which is a column Take the dot product right if you use NumPy NP dot dot and You add the the bias term to it again bias is a proxy for threshold, right? It's a constant term you add that to it and That's my output you feed that to a nonlinear activation and that's my output And so now you know the natural question is sure I want some activation function Which one should I pick how should I pick one? Do I just like make one up? So what constraints do we have it has to be nonlinear it should ideally be simple to evaluate Computationally don't want something very complex It of course has to be it actually technically you can relax the fact that it's Bounded it doesn't have to be bounded. It has to be nice and continuous And you will see later that we need derivatives of this function So the derivatives are hopefully easy to evaluate the first derivative And so again going back to what we said initially we were doing a binary switch, right? We were saying take this weighted sum plus the bias if it's less than the threshold it's less than zero then it's Zero if it's more than zero then it's one Right so the bias term basically says instead of a threshold you can compare to zero And so that's the that's what the shape looks like But it's not very useful because the derivatives are zero everywhere except at one point And so if the derivative is zero as we'll see we need that information without that I really can't do much with this I can't train a neural network It's alright Maybe I can cheat a bit Maybe I can like make it smooth and nice and continuous and so it looks like this s-shaped thing It's called the sigmoid function, which is also used in logistic regression for example and You can think of the sigmoid as just a continuous nice smooth version of the binary switch Right instead of saying zero or one you say yes, it is zero when you go to the left It is one when you go to the right, but in the middle it has a smooth interpolation between zero and one and The nice property that this function has is if you compute the first derivative Which is listed here the prime means the first derivative You can rewrite the first derivative in terms of the function itself. That's generally not true But again imagine if you are evaluating this function billions of times Then you don't want to evaluate two functions So you can just evaluate sigma x and reuse that to compute the first derivative so it has very nice properties and it was used a lot in Every single neural network till maybe three four years ago. Now there are better choices So what choices are there one is Something that looks almost the same. It's called the hyperbolic tangent, but it varies between minus one and one Then there are you know all kinds of fancy names. This is called a relu rectified linear unit I mean it's simple right if x is negative. It gives me zero if x is positive It gives me x back But this is used a lot now because it's very easy to evaluate you're doing nothing Just comparing to zero and returning the value if it's greater than zero And then there are modifications of that someone said This is not very rigorous right so someone sat down and said you know this is interesting But when x is zero or less than zero I always get zero back Maybe I shouldn't get zero back. Maybe it should give me something back with a different slope So you add two lines, right? You just change the slope of the left part Then someone else came and said yeah, but that's too. It's it's not continuous at zero I want something else. I want something smooth since I forget what this is called exponential value or something like that It's a there's a smooth version of that and there's no science here There are tricks people do engineering they look at what the training process is doing and they say maybe I can tweak this Maybe I can tweak that and that's okay, right? I as far as we know there's nothing very deep here It's a choice you make It's dictated by the kind of problem. You're solving sometimes. It's dictated by computational efficiency and you often get very Weird behavior during training, which we won't go into and so well I'll tell you one which is many of the nodes in on your neural network die Right there the weight becomes so low that they always give you zero and I can't do anything with that So then people choose a different activation function So to summarize We basically started with a neuron we generalized it we said it takes n inputs takes a weighted sum Feeds it to this non-linear activation function can be the sigmoid it can be hyperbolic tangent something else Outputs a number That's a neuron if I put many of them in a network together, then that's a neural network and we can Approximate any function to arbitrarily high accuracy with this weird method that we created So this is again Some notation, but it's very useful the input inputs are I1 through I4 and Now we have a1 a2 a3 which are the values at the three intermediate neurons as Before they are the weighted sum plus the bias Fed to an activation function in all three the only difference is each node has its own weight Each node has its own bias. So that's why you have these subscripts You have one two three one two three get three numbers at a1 a2 a3 and then you say okay What next I have an output which takes the the value at a1 a2 a3 Takes a weighted sum adds a bias returns that to me right soon. Yeah, please So it's not a Yeah, so the question is doesn't the network always look like a tree And that's not the case that other architectures like recurrent neural networks which have loops But there are other training issues that crop up But for sequential data time series language what you use nowadays are a Sequential recurrent neural networks But this is still for almost every vision task for example recognizing objects What's used in self-driving cars for example? It's basically this architecture with some modifications And so one more layer of abstraction in notation So we use vectors and they can be collected in a matrix So this is the shortest way to actually work with the neural network. You have your inputs Which is a vector you multiply it by a matrix a matrix has n rows and m columns And you can think of it as something that maps the an m dimensional Actually, I shouldn't say n and m they sound the same it's an a row by b column Matrix a box and it takes something with b dimensions and maps it to something with a dimensions So you should think of a matrix as something that takes a vector and maps it to another vector in a linear way with different dimensions So if I want to send the three-dimensional vector to a ten-dimensional vector I would multiply it by a ten by three matrix And so what this says is it says give me an input which is let's say four dimensions Multiply it by a matrix that takes that input to the hidden layer right and so now it might have five dimensions five nodes Apply the add a bias to every element in the vector Apply an activation function Good now. I'm sitting in the hidden layer where the activation function is applied. I have a vector Multiply that by another matrix that takes me to the output. I might have one output So it takes me to a number. I might have 15,000 outputs It takes me to a 15,000 element vector and Add another constant bias to the whole vector. It's at the by constant bias I don't mean C is added to every element in the row C is a vector, too So you can imagine implementing this and you know, I don't know numpy It's one line very simple Of course the trick is I don't know what V and W and B and C are Right. So a lot of the work is that we'll do is how to find them efficiently So think of a neural network at least in this architecture a tree type one as Multiply by a matrix add a add a bias vector apply a non-linear function repeat so Fine, I have this network. What do I do now? Right? This is this is very theoretical. I don't know what to do with this So here's a practical example of how you would really do something You would get a bunch of images, right? So a bunch of cute animals each one has a name so you have labels So, you know, okay, this image is an elephant. This image is a dog. This image is a cat You would feed this to your neural network So you would take and so one of the as a side note one of the really powerful things about neural networks is Classically if I give you an image I said, can you tell me if this is a dog or a cat? algorithmically right not not humans You would have to do very specialized things like can I pick out yours? Can I pick out like shapes? Can I write these special algorithms by hand to pick out all these shapes in a robust way? But neural networks don't need that you give it pixel level information if there's a 10 by 10 image or a 10,000 by 10,000 image you actually give the raw pixel values and it figures things out So you don't really have to do any preprocessing except for some minor steps like scaling Just feed this thing in and it figures out what to do So you take the images you feed it to your neural network at the pixel by pixel level You do the so-called forward propagation you make a prediction using your network So again same as before I has all the pixels so it can be a huge vector This is for now an unknown weight matrix. So you randomly initialize it Add this bias randomly initialized feed it to the activation function Multiply that by another weight matrix add another bias all the weights and biases are randomly initialized We don't know what the values are we just do something And you get an output the output might be is what's the probability that this image is a cat Right, that's an example of an output if you're doing a regression problem predicting house prices It might say the price of you know, this person's house is $10,000 So what do I do with the output? Well, I have to compare it to something. I know what the answer should be I know this image as a cat because you gave me labeled data so I compared it to that and So the output is oh so always what we predict using the neural network the actual value is V It's a cat dog and so on and I need a way to compare these two So there's something called a loss function, right? So it's called the cost or the loss and in this case. It's a very simple one It says well, you're predicting a number. I know what number I should get. Let me subtract and square them If they agree, then I'll get zero. So I have no cost or no loss If they disagree the more they did disagree the more this number grows So if you had to predict 10,000 and you predicted a billion, of course, that's a huge number 10,000 minus a billion squared But if you had to predict 10,000 and you predicted 10,001 Great that's very accurate and this would give me one or half in this case And so what I want to do is I want to take all my data feed it to my neural network make predictions Use my loss function to compare my predictions to the actual output. I expect and Then I want to tweak the weights and biases right the W's and the V's and the B's and the C's Because we initialize them randomly. So that's of course not going to work So can I come up with a way to change those in a smarter way? So that this loss function goes down if I can minimize this loss function Great, right? I make accurate predictions. So in a nutshell, that's the game find that these W's B's and C's that minimize this loss function and That process is training right so all this stuff about GPUs Distributed computing all that effort is going to find these parameters. That's it. Once you find them, you really don't need much So this is the whole game in a nutshell. You have a bunny You know the value is a bunny you make a prediction What did what your neural network says is it's a polar bear? Say okay, I'll use my loss function doesn't agree Let me change the values and see if I can make the output be bunny So we need loss functions, right? So any questions till now? I know this is like dense stuff at least some of it is but The takeaway at this stage should be I have this thing that someone invented It's basically repeating the same calculation again and again It has some parameters the weights and the biases that I don't know and I want to find out the weights and biases that minimize my Miss calibration between predictions and the actual outputs and so the question is how do I measure the difference between my output and my prediction and Often the way that is done as you treat each example each piece of input data independently So that's what this sum means the sigma is sum over every single data point that you have i equals 1 to n and for each piece of data compute a Difference between the prediction which is why with that funny hat and The actual value which is why? give me the actual label prediction computer cost or loss and That's some number that's greater than zero most of the time and add it all up. I want to bring that down So how do I choose that see right? So I say okay I'd reduce the problem a bit instead of trying to come up with a loss function for all the data at once I have to do it for one example great But how do I actually choose what that that function should be what C should be? So the simplest one you can pick is mean squared error What we saw before which is prediction minus the actual value squared Why do you square it? Because you want it bounded below by zero if it's not you know the minimum that you should hit is zero Which is exact prediction and any deviation from that should only go up So you put the square can put absolute values, but they you know, it's not differentiable all that stuff so this works What happens if so that works very well for regression problems regression is predict a number It's a predict house prices predict the stock value Predic memory usage works extremely well for that what happens if My problem is different. It's binary classification, which basically means I'm predicting one of two categories You give me images and I want to tell you if it's a cat or a dog You tell me whether I should This is a random example Given the state of the world. I run a military base. Should I fire a missile or should I not fire a missile? Is my email spam or not spam right binary decisions like that are called binary classification problems So what you're predicting is class zero or class one or what you often predict is a probability You said the probability that you should this email is spam is point nine the probability. This email is spam is point one And then you put a threshold You say point five fifty percent if you predict the probability of something is more than fifty percent And I'll say it's in class one. You say it's less than fifty percent. I'll say it's class zero and One way to measure how well you are doing is to look at accuracy Right, so I make a prediction of probabilities. I put a threshold point five anything more than that is one anything less is zero and I just count what percentage of my predictions are correct. You gave me ten thousand emails I said three thousand a spam How many of those did I get right and how many of the non spam emails did I get right? Maybe I made seventy percent correct predictions. Can I use that to basically measure my loss my cost? And of course in this case you don't want it to be zero you want it to be hundred but similar idea and the answer is You really don't want to do that because one your accuracy is a function of your threshold If you pick fifty percent you'll get one answer if I pick forty five percent. I'll get a different answer There's no real reason why you should pick fifty percent, right? I mean we like fifty percent It's midway, but there's really no reason that should be the number. Maybe it's ten percent who knows So one is dependent on this threshold and you don't want that dependence It just adds complications to accuracy is very discreet, right? If I change the underlying weights and biases of my neural network It's very hard to know how the accuracy will change because it will change the probabilities in a continuous way, but Even if I changed the probability of something from point four to point four one the accuracy doesn't change So it's it's too indirect. So you don't use accuracy at all You say, okay Maybe what I should do is the probability is between zero and one My class is class zero class one. Let me just subtract and square them mean squared error But why is either zero or one and P is a number between zero and one these are probability Let me just subtract square them That should work and it actually works, right? So it sounds weird, but this works the reason it sounds weird is you're comparing a discreet object zero or one To a continuous probability between zero and one and doesn't really make sense, but sure it works in practice The problem with this is if you get a prediction, right? You of course get zero right if the class is one y is one and you predict your probability is one perfect one minus one is zero But what if you predict the wrong thing? What if an email is spam and you should be predicting one But you predict zero so you got it completely wrong you said it's not spam In that case this gives you one minus zero squared, which is one So the worst that this can get is one and that's not hard enough, right? That you want the teacher to be extremely strict and this is not strict enough for the neural network It doesn't impose a heavy penalty when it misclassifies something so then you go to Mathematics right you go to probability and you say okay. What's the likelihood or in this case? What's the probability that if my neural network predicts p I? Get class y and I'll go a bit quick through this because There are too many details, but the basic idea is You can write the probability as p raised to y times one minus p raised to one minus y Right, so if your y is zero if you're in class zero then this term just dies Anything raised to zero is one So I get one minus p. Yes, the probability of belonging to class zero is one minus p if y is one Then the second term dies right then the second term has an exponent one minus one which is zero So why is one I just get p so of course that makes sense when y is one I get p when y is zero I get one minus p so this is a compact way of writing the probability of Getting the value y when the prediction of the neural network is p Now likelihoods of probability is multiply if you have independent samples you can multiply the probabilities And so you say okay. I have this huge data set. Let me multiply all these likelihoods each example has a label yi Each example has a prediction p I let me compute this thing for each thing Let me multiply them together and I want to maximize this so this technique is called max likelihood It's used everywhere in statistics and machine learning which is let me build a model that maximizes my probability of predicting the right thing So you want to maximize this as soon as you get to this point you say victory You know minimizing and maximizing is essentially the same thing tack on a minus sign So if I get to the point where I need to maximize something I can turn that into a minimization problem Except you do a few tricks you say let me take the logs because logs take products and convert them to sums So now we get something that looks like some over the cost of each Example, which is what we wanted, but I still need to you know log is Technically a monotonic function. So if I have to maximize something it's equivalent to maximizing the log Can I turn that into a minimizing problem? Yes put on a minus sign Maximizing a function f is the same as minimizing the function minus f So you say all right, I don't need to worry about this. Just put a minus sign in front, which is right here and You get this weird looking thing and we'll try and understand what that actually does But you say I'm going to maximize that thing The wise are fixed the wise are the labels that you gave me so I can't do anything about the wise zero or one But the peas are the things that I'm predicting the probabilities and I want them to be the right ones So when you get something messy like this, it looks scary, right? You just take it apart just break it down and you'll often see that When you stare at something like this for long enough it becomes very intuitive, right? It almost becomes obvious, but when you first see it you said who came up with this and So let's see what happens. Let's say my label is one Right the image is a cat and that's one by definition for us and So if I put y equals one the second term completely dies out one minus one gives me zero this term just drops away So that's a good sign if y is one I just need to worry about this thing which is log p because why is it disappears So I say I get if y is one then the likelihood is minus log of the probability of predicting that thing If y is zero then the first term dies out The second one gets one minus zero which is one so if y is zero I just get minus log one minus PI Okay, that's simpler. It's just log of the probability or log of the one minus the the probability so what happens if Y is one and I predict that correctly so why is one and PI is one in That case I get minus log of one log of one is zero I get zero so good if I predict the right thing My loss is zero as it should be What if I predict completely the wrong thing so why is still one but PI is zero in That case I get log of zero or negative log of zero which blows up. That's infinite It's a log of numbers less than one is negative and as you approach zero it said it's going to infinity So this is much stronger when I make the wrong prediction here. My loss blows up. It's infinite What happens in the other case when y is zero Basically the same thing when you predict the right thing your loss is zero when you predict the wrong thing your loss blows up so the key thing here is this is motivated by probabilities and likelihoods and so it's meaningful and It's something that gives you zero when your predictions are right and it gives you something that's infinite When your predictions are wrong and it smoothly goes from zero to infinity We won't go through this but it's the same thing and that the meta point I want to make is if you are interested in deep learning or machine learning and you look at things like this Never get intimidated. Don't think this is you know This is not to win a fields metal. Just take it decompose it look at each term It takes some time. It's like reading a very complex function and see might take time But you'll figure it out same thing here looks messy, but you see the same structure This is the case when you're predicting not two classes, but 10,000 classes a thousand classes It's called multi-class classification. That's the same thing basically looks at the logs of the probabilities that you're predicting and Really you can write your own loss function Right now you can't write anything it has to have these simple properties. It should be ideally bounded below by zero It should increase smoothly the worst your predictions get but really people design their own loss functions based on the problems They work with so another one is something called the Colbeck-Liebler divergence which compares to probability distributions That's used for visualization problems So it really depends but mostly if you are entering deep learning and you're trying things out What you will end up using are the commonly used ones So all that is theory right like what did we do till now? We came up with this neuron thing. We said it works very well if I put them in the network I can approximate any function. I want I then have to design this loss function It has all this weird stuff in it then I'm going to minimize this and find the right weights That's okay like that that you need that to actually work with these things But let's try and get some intuition for why they actually work So one way to get intuition is to pick a very simple problem Right if you can solve a simple problem, you can't solve a hard problem So the very simple problem we pick is you have these two regions the red dots and the green dots Every point has an x and y coordinates. It has two numbers Anything that's in the center is class zero Anything that's in the green ring is class one and I want to come up with a way to Take the x y coordinates of a point you give me the x and y from this plane and I want to tell you whether you're red or green. It's a red is zero green is one And of course a human being will look at this and say that's easy Look at the radius right from the center if it's less than me. I don't know one You're red if it's more than one you're green done right. I don't need all this stuff But what if you are an algorithm either I can write that logic myself, right? I can say square root of or x squared plus y squared or square root of x squared plus y squared There's a threshold, but I don't want to hard-coded threshold generally. I want to learn that and so this is Again, we won't go too much into it but there's a very nice technique for these kinds of classification problems called logistic regression and it does very something very simple It takes the input data x and y it computes that linear combination ax plus by plus c and Feeds it to the sigmoid function and it predicts a probability you give me a point I'll tell you the probability if it's zero or one and it These are familiar things. We just looked at sigmoids. This is the same log likelihood thing that we work through it basically tries to minimize this but it doesn't have these hidden layers and all that stuff Right. It just takes the numbers feeds it to a sigmoid predicts a probability This is a sigmoid again And you can see why you can treat this as a probability it a sigmoid gives you a number between zero and one Yeah, that's a probability for our purposes And we'll make a simple decision. We'll say anything that's more than 50 percent Probability is class one anything that's less than 50 percent is class zero. Let's make life simple and so the question is What is that boundary right? Where do I actually make the decision that something is in one and something is in zero? Where does it cross that 50% boundary and the answer to that for a sigmoid function is a sigmoid function is one is 0.5 Every time its input argument is zero It's a sigmoid is very simple It will predict a probability of 50 percent when its input argument is exactly zero and When is that zero well in our case? We have two inputs. So we have x and y So when some number times x plus some number times y plus c is zero I get a 50% probability So if you have built logistic regression models, that's called a decision boundary and Ax plus By plus c equals zero is geometrically something very nice. It's a straight line Right, so you get this equation of a straight line Which means if I fit a logistic regression model to this data I get that straight line it says anything below that straight line is green says anything above that straight line is red So this is horrible. It's 50% accurate all the yellow stuff is things I got wrong And so clearly this doesn't work or in other words logistic regression is used when you can separate your data with a line Or a plane or a high-dimensional plane. Of course it doesn't work So as we said before you can of course say let's pick the radius that would work But this is two-dimensional data. What if I have, you know, 10,000 dimensional data? I can see it, right? We can visualize even four dimensions So of course you want something that's more automated So we'll use our neural network. We'll minimize the the log likelihood It's also called a cross entropy. We'll minimize that loss function that complex looking thing and We'll predict a probability. Let's see what happens So what do we expect will happen? So what we can do is just stare at this for some time and say, okay, I have x. I have y I do something that looks like w1x plus w2y plus b and I feed that to a sigmoid. We just looked at this I know that number is 50% when this argument is zero when I Am on a straight line The same thing happens for the second neuron. It has its own decision boundary So every neuron is drawing its own line. The first neuron says I'm going to draw my line The second one says I don't like you. I'll draw my own different line. The third one will draw its own So you'll end up with three lines One for each hidden node. So what if I have one hidden node? Well, it draws one line that blue line and it says everything to the Upper left upper right is green everything below that is red. I Can add two nodes, right? So these are the number of nodes in the hidden layer I keep increasing them if I add two I get two lines Okay So everything to the top left and the bottom right is correct They're all green in the center all the red is correct But what it's doing is it saying anything between those two lines is red So of course it gets these two yellow regions completely wrong draw a third line Now it gets everything right Because the three lines basically each one I I'll use this word loosely each one cooperates and says I can isolate that red region by drawing three three lines So anything that's between us is red anything. That's not between us is green. So One line didn't work two lines worked better three lines is perfect. I can add five nodes So I get five lines, of course, it's not useful at this stage. They're just trying to box in that red region I can add ten same thing, right? So beyond a point. It doesn't really matter You have perfect performance adding more nodes really does nothing and so what's going on here, right? What happens when you draw one line remember? We said that line is the line Where the probability is 50% right? So that line says if you are on me, then I'm going to output 50 percent point five So as you go to the bottom left that fifty percent goes to hundred percent. So that's the arrow that says so The arrow that says node goes to one if you walk in the other direction your value goes to zero So that one node says I care about nothing else If you are on my line, then I'll give you a score of point five if you walk in that direction I'll give you zero if you walk in that direction. I'll give you one What does so in this case? You have two lines the line is doing the same thing the orange line It's saying point five zero in that direction one in that direction And so we introduce a new coordinate system basically we say everything that's above the line is c equals one Everything that's below the line is c equals zero right of Of course, the other line says I have my own Coordinate everything that's above me is d equals zero everything that's below me is d equals one So each line is dividing the whole region into two pieces and it's assigning labels of zero or one to each piece and You can combine that right you say, okay So what's happening in the top left region is c is one d zero one line said you are one one line Said you're zero in the center boat line said you're zero below that boat line said One said you're zero one said you're one and this can be thought of as a different coordinate system So the nodes are basically saying I don't like x and y I'll work in the C and D space that's the one I like and One of them happens to be mixed it has red and green points, but two of them appear So what's going on is each hidden node puts a line and divides the whole input space into two Categories two regions. So if I put n binary n hidden nodes, I'll get to raise to n regions So I get an exponential number of regions and of course if I divide it into enough regions I'll get these pure segments something some will be purely green some will be purely red And so the idea is keep increasing n right? So if you give me an image classification problem, I'll give it n equals 10 so I get a thousand and twenty four regions and that might not work So I keep increasing this eventually I'll do well Another way to look at this is when I feed in the x and y coordinates each node gives me zero or one So I get a tuple of n coordinates each one being zero or one So the x y gets mapped to these n numbers and they are one or zero and what does that geometrically look like? One number between zero and one is an interval Two numbers between zero and one is a square in two dimensions. You can plot it three numbers between zero and one is a cube n numbers is what's called a hypercube can't visualize it, but that's our n dimensional cube and So you're doing a coordinate transformation. You're saying x and y is not what I want to work with I want to work with these new coordinates and That's what that looks like this So on the left side you have the original data the red and the green points you have the C and D coordinates the neural network transforms that into a C and D space and Remember they were all zeros or ones so you fall along the edges of the square So everything here is on two edges of the square if you look closely You will see that this region in the center which is this part has one green and red points So this red stuff gets mapped to this red circle The stuff on the top gets mapped to this green ellipse here the stuff on the bottom gets mapped to this ellipse here So the neural network is taking your data and saying I don't like it the way it is I'm going to transform this right. I'm going to put this in a new coordinate system and Then remember we always have a hidden node a hidden neuron at the end and that acts like logistic regression It says okay. I'll forget about the original data. I'll work with this new data that you gave me on the right Can I draw a line to separate the reds and the greens so it makes this feeble attempt to draw this this Start working so it draws this line right here It says anything above that line is class zero anything below that is class one Clearly doesn't work very well in this case But remember when we added three lines we had perfect predictions right these three lines took the red and boxed it in What that means is in three dimensions you get it you get a cube all the green stuff gets thrown on the edges Right, so they all go across a cube, but the red stuff gets pushed to one corner So the neural network basically is saying I Like this coordinate system because here I can put all the red stuff in one corner all the green stuff everywhere else and Then the output neuron says can I draw a plane instead of a line because this is three dimensional Can I draw a plane to isolate the red and it can right? So it draws a plane cuts the red stuff off and the plane says anything below me is read anything above me is green So the takeaway here is a neural network can be thought of as something that Transforms your data into a different space To try and use something simple like logistic regression to separate it now It's not technically logistic regression. It's a different loss function, but basically at its core That's what it's doing. Now. That's a very simple classification example, right? So we the takeaway there is the number of hidden nodes you have if you have one hidden layer is The so two ways to that number is the number of regions that you divide your input space into or Another way of saying that is you basically take your input data and transform it to an n-dimensional space in the hope of Easily separating the two classes. That's what it's doing except we don't have to tell it how to do the mapping it learns that It it tweaks the weights to learn that mapping The other problem that's often seen is regression predict a real number So again pick the simplest possible example You have sine x if I give you x can you tell me what sine x is of course? You can exactly compute this but we want to see if it works on this if it doesn't work on this This is a bad technique who cares and so again same architecture The only thing that changes is at the output instead of all that fancy log stuff We just do mean squared error my value that I predict the actual value subtract square just use that and Yeah, let's see what happens. So Again, you want to first look at the network and say what would it do? So in this case we have only one input. There's only x So what does this do? It takes x it multiplies it by a number w1 adds a bias feeds it to a sigma a sigmoid function and The decision boundary remember is where this gives me point five right where it changes from below point five to above point five and In this case a decision boundary is where that expression inside the brackets is zero Well when that is zero I can exactly solve for x right I get x is minus the bias over the weight So what that means is when I'm to the left of x when my x is less than that value? I get zero as soon as I cross that value I get one For that one hidden node for this the top hidden node The next one says I'm going to do the same thing But with a different weight and a different bias So it comes up with its own threshold that x and The third one will do the same thing each one will say if you're to the left of my threshold I'll give you zero if you're to the right. I'll give you one. So what does that look like? something like this in This case you have four hidden nodes each one has a value Which is plotted here. So this is the value for the third one value for the first one and so on Each one when it crosses that boundary goes from on to off or off to on so it goes from zero to one or one to zero So here's an example right like I just made up something. These are the four Bountries What this vector says is which nodes are on and which are off? So this means node one is on Note two and note two and three are off. There's zeros and node one is on and When I cross this boundary It's delta three. It's the one for the third node when I cross that boundary The third node goes from off to on goes from zero to one When I cross the next boundary, which is the one for one Then I go to the first node and it's already on so it turns off It goes from one to zero This one is for the fourth one So now the fourth one is one and as soon as you cross the boundary it becomes zero and Then this is the second one. So the second one goes from off to on zero to one So the idea is if I have four hidden nodes I get four values on my x-axis and each time I cross one one of them switches sign It goes from zero to one one to zero and what happens if I take one configuration, right? In this case, it is these are the four nodes one zero zero one one zero zero one and Remember the output is each one gets multiplied by a weight you want you to you three you four and a bias gets added So if I do that, I'll get you one times one Plus you two times zero which is zero plus you three times zero which is zero plus you four times one So I get all the ones that together. I get you one plus you four plus D some number So I get that number So when I'm to the left most side and I start like this I basically add the corresponding weights so you one plus you four plus D Then I switch on the third neuron when I switch on the third neuron it starts making a contribution So it takes the old value but adds in a three. So you see you want you for D and a U3 Then I turn the first one off When I turn it off that term disappears so you won't completely disappears from here and the same thing repeats you four disappears and You two shows up So in other words if I have four hidden neurons I get five distinct values that my neural network can predict these are the five values And its job is to figure out what the you ones and all the use and the D's are so that it can approximate the function I'm trying to approximate The general takeaway here for hidden neurons you get five regions. You have n hidden neurons. You get n plus one regions And this is what it looks like, right? So remember x-axis Y equals sine X. I have two nodes in the hidden layer. So I get three distinct values and You see the green line, which is the neurons turning on So as you go from the left most side Neuron number one turns on the neuron number two turns on and the orange stuff is the actual prediction So you can actually see each time something turns on it increases in value So when you go from the left most the middle region you see a bump and then when you go again You see a bump now, of course, you would look at this and say this is horrible That's not sign that yellow curve and the green curve. There's nothing like sign to them. It doesn't work as A technicality you see this s-shaped thing. That's the sigmoid So each time you turn on you don't discreetly go from zero to one you go along the sigmoid thing So, okay, same game. Let me increase the number of neurons. I'll put three So when I did do three I get four regions, right? There are four horizontal green segments at different levels It's okay. I mean towards the right. It's fitting a bit towards left. It's not Four neurons. So five regions. It's getting better But ten it does it perfectly right and If you didn't have the sigmoid the way it is if you didn't have those s-shaped pieces it of course wouldn't do this Well, it would just be the green line. That's not very good, but the sigmoid helps it helps to smoothly turn things on and off So you say, okay, maybe that was too easy and I should add a neural network is dumb Right a human being if I say sign x they'll say I know it's periodic I know it's defined over all real numbers and it goes off in both directions and it's periodic everywhere That extra information is not given to the neural network. So if I give it a value of 15,000 and say tell me sign x it will give me so it will give me a number. It won't make any sense Right, so you basically focus on the region of input. That's given while training. That's where you want to operate you don't want to operate outside that and That's an open research problem, too If you had a neural network and you wanted to give it that information You wanted to say there's a symmetry. There's something more in the data Which is when you increase the argument of sign x by pi by 2 pi I get the same value. It's periodic How would you do that? But in this case, let's pick a slightly harder problem sign x, but there's a decaying exponential So if you see it decreases that as it goes from left to right and it gets narrower Can can I predict this so two nodes three regions? No Three nodes four regions no What about ten? Still not and you see something interesting. It tries to fit the stuff on the left Because the loss on the left is higher the values are higher So the losses are higher so it tries to minimize the loss where it can't force and it says if I solve this part My loss decreases dramatically the one on the right. It's so small if I get it wrong. My error is small. I don't care Yes Because it's trying to get the two Yeah, so the question is what's happening in the middle, right? So you have the orange stuff works on the left it seems to work in the middle and then you have that that decrease but it's not really fitting that well there and The answer to that is when you do the minimization as we'll see It doesn't actually find the minimum of the loss function you terminate it somewhere So if I let it run for more time, it will try and fit that piece better But what you do in practice is at some stage you say run any iterations and truncate stop So if I let this simulation run for I don't know 10 more steps or 10,000 more steps It would actually keep trying to make the predictions in the center better and forget about the one on the right And so that's a good point actually that happens in all neural network problems You don't know when to stop the training So it will keep trying to minimize the loss at some stage It starts flattening the loss curve and you say do I want to keep going because sometimes you see this dramatic reduction But often you just say okay. I'm happy with this done and if I put enough hidden nodes then I get with hundred I get hundred and one regions and Even the green stuff doesn't look like it's straight horizontal lines. There's so many regions. You it almost looks like there's some structure there but this is just Putting as many regions as it can and trying to fit a value in each part And I would like to add that for both these examples This is this is not what you would do in practice right in practice You don't say let me keep increasing the complexity of my neural network till I fit the data That is called overfitting. It's called memorizing your data set What you do is you take your data and you split it in two chunks and you do all this on only one chunk And when it's done you make the predictions on the second one and you see how well you did there and Overfitting is when you did very well on the training data You did extremely well and you apply it to the other one and your performance drops like a rock Right, you know you did something wrong. You you memorize the data set So you want to do it in a way that it generalizes when you give me some other data. I can still make good predictions This is just one understanding. So ideally the way I would do the regression one is maybe randomly sample or to make it harder I would say truncate the x-axis region in two parts that are disconnected and Only train on the left one and see if you can extrapolate to the right and it won't do very well right in this case but that is Again that the idea that a neural network is Basically doing these simple operations at its core But many times and it has the information about how each piece is operating and it puts it together That's what basically makes it tick. That's what makes it work It's much harder or even impossible to do this interpretation when you have real complex data sets so a big problem in deep learning in general is You train this network your performance is outstanding even on new data sets and then the question is why did you predict this? I want to understand that so there's a key distinction between Prediction and understanding and if you look at all of science, I mean a typical scientist doesn't care about predictions And that's not interesting that is a means to get to real understanding if I can predict and study What model I used to predict maybe I can understand what's going on in the system But prediction for its own sake is mostly not that useful except for small problems, right? I mean if I don't know if my my apartment building wants to predict when I'll be home and automatically switch on the heater Maybe I don't care, but if it's a self-driving car And it's going to make a left when there's a stop sign I want to understand exactly how it came to that decision. Why did it do that? And so that's an Apart from some problems, it's an open problem to figure out what a neural network is actually doing Phew long, right? So any questions here anything else? Yes Yes, so that's a very good question. The question is In this case, we first looked at the approximation theorem and said we can approximate any function any reasonable function to arbitrary Accuracy, but we pay a price which is if I want the accuracy to go up I want that epsilon to go down then I have more and more neurons exponentially more neurons and You saw that right like in the regression problem. I had to go to hundred imagine if it's something more complex It might be a million So why you know it works? I mean I can always keep scanning this or do something like binary search and keep increasing and decreasing and see what happens Why do people add more hidden layers? Like why does that work does that help with the universal approximation theorem by adding more hidden layers? Can I change that? Exponentiality to something more well-behaved and the answer to that is I don't know and that's an open problem in general There are Depending on the problem. You can come up with heuristics. You can say, okay This is what the mappings are so I'll give you an example for image recognition You use something similar but with some tweaks called a convolutional neural network and there what the layers do is They they learn larger and larger features and what I mean is if I give it images of cats and dogs and I don't know elephants The first hidden layer will learn something very coarse like straight lines and curves The second hidden layer starts learning more complex things like eyes and nose and ears The third hidden layer starts recognizing faces the fourth one might actually record to start recognizing whole animals And so that's what depth does and so it does help It does have to kill that exponential behavior You don't need to do that anymore. You can just keep adding more But it comes at a cost the cost is training becomes much harder Because as we will see when you do training you have to compute these first derivatives And when you're doing it backwards you get all kinds of issues because the derivatives towards the beginning of the network are very small And the ones towards the end are reasonable and that slows down training towards the start of the network. They're all these small technical things But there's a guy at Princeton University for example, not my relative His last name is Aurora to his first name is Sanjeev no connection at all but I sometimes get credit for his work because it's a similar name and He's doing a lot of work on Mathematically trying to understand what depth does it does depth Add something to the approximation theorem basically But yeah, the short answer is I don't know. I don't think anyone really has a tight bound on that So let's see. How are we doing on time? Do all right Okay So we came up with a neural network. We saw what it does intuitively We said we want to now find the weights right the matrices and the biases So at this stage, let's say we are convinced that this works well And you say, okay, how do I actually find these weights and biases? I want an efficient way of doing this Why do I want to find them because I want to minimize my loss function? so let's first look at minimizing a function and That field in general is called optimization or mathematical optimization and the idea is simple You have some function with an input x Some value f f of x and it's some squiggly thing And I want to find the x where this takes the smallest possible value And so that's the minimum. That's the maximum and Some terminology if you take any function and you find the absolute maximum that it can ever take Right anywhere in that functions region. That's called the global maximum. It's global But if I focus on a region of the function like a small part, I might still see a maximum I might see a peak that's called the local maximum the local maximum in general is not the global maximum and so the eventual problem is to find the global maximum or the global minimum But sometimes your function is too complex and you have to make do with the local one and The same definitions apply for minima or minimums and That's what it would look like right so if you were to stand very close to the screen Maybe all you would see is this region in the center and so you would say oh, yes, that's my maximum That's my minimum But it's local because you don't know what exists outside So you step back you zoom out and when you zoom out you say oh, no, that's the global maximum That's the global minimum. Can I try and find the global minimum? I'm trying to minimize my loss. Can I find the absolute lowest value that my loss function can take and So that's training right learning training is basically How do I go about? Minimizing this function and what knobs do I have all I can do really is Change the V and W matrices right all the weights and biases So every neuron has these weights coming in and every neuron has its bias can I change these values in some smart way To minimize the loss that's that whole process is training So all this stuff about GPUs and you know training on large clusters is eventually to answer that question You also use this for Something called hyper parameter tuning which is a fancy way of saying the training only helps me with the weights and biases But the training doesn't tell me how many hidden nodes I should put in it doesn't tell me what activation function I should use those decisions are made by basically putting everything in a loop right saying try this try this try this and Smart optimization techniques help with that. They say you don't have to do a raw loop. You can do something better So optimization is by it's a fascinating feel and it's not restricted to deep learning We only look at a very tiny slice of it, but optimization is used everywhere right FedEx uses it FedEx says I have all these packages I have to deliver to all these people These are the pair-wise distances between everyone and maybe I want to minimize the time spent or maybe I want to minimize the tools I pay or maybe I want to minimize the fuel used and Finding the solution to that is an optimization problem So what are the restrictions that we might have? The function we are trying to minimize might be too expensive to evaluate and If it's too expensive to evaluate, I don't want to evaluate it too many times. So that constraints my technique The other problem might be that F is discrete Right, so the traveling salesman problem is a discrete problem I either go to point one two and then three or I go to three one and two I have six Permutations, but there's no notion of smoothness. I can smoothly change something. I have to just swap to cities And if something is discrete, I can take derivatives if I can't take derivatives I can't use a whole swath of techniques. That's a restriction. Thankfully. That's not our restriction. Everything is continuous here a Third restriction might be that while you can evaluate your function well The first derivative might be hard to evaluate or maybe you can even evaluate the first derivative But the second derivative might be hard to evaluate And so there are techniques that use these higher derivatives that I can't use and in neural networks You generally use only the first derivative. It's too expensive to evaluate the second derivative as we'll see And then the third problem is the function you are looking at might be squiggly, right? It's not like a nice ball. It's a nice ball. I can solve it, but if it has many local minima many You know something called saddle points a lot of structure to it then In general, you can't find the global minimum. There's no technique for that You try right the smartest strategies for that, but there's nothing that guarantees you the solution except for running forever And the last one is the function might be high dimensional, right? We plotted a one-dimensional function You can visualize it. What if it is a function of you know, 10,000 variables? I can't visualize it. How do I build techniques that work on that data? So the one that deep learning uses in a dominant way are all Versions of gradient descent, right? So there are tweaks to this which work better in practice But this is at the core and gradient descent is very very very simple. It says I'm in the I see this ball I See this ball and I want to get to the yellow star That's the minimum if I am to the left here. What do I do? I look at my slope and I say the slope here is negative I'm pointing down if my slope is negative. I'm going to move to the right If I'm to the right where the slope is positive, I move to the left So it's like being on a hill, right? If you're on a hill and you say oh that's sloping down What do you do you move forward if you're on a hill and you're looking up and you say no it's sloping up What do you do you turn around and you move backwards and that's what this is You're standing on a hill here if you look down you move in that direction if you're looking up you turn around and you move down That's it. So what do I do? Well? Can I come up with an iterative method? Yes This x here with the superscript t means this is my value my position at time t So I want to come up with something that looks like that update equation It says start at some position update it and get the next position and the question is what's update Well, we just saw that the update is opposite to the slope of to the sign of the slope The slope is positive. I move in the negative direction if it's negative I move in the positive direction so it's okay that means the update should be proportional to my slope the first derivative and It should have a negative sign That's all and the eta is a parameter that controls how big your step sizes do I take tiny steps? Do I take huge steps? So it's also often called a learning rate So if you if you're using keras or tensorflow or something and you see learning rate It's basically that parameter that tells me how fast I should move when I minimize a function and We are running out of time. So I'll show you a few examples This is You know a simple simulation, right? It's a simple example in Python. It's very simple chart It says start here The learning rate is point one the start point is five and I just run that update I calculate the slope and I move in the in the direction opposite the slope and In this case it nicely converges to the minimum great, right but I Increased the learning rate. I said the learning rate is too small. Let me make it large. It's 1.1 in this case So what happens is it says when it's at the start point? It says I need to move to the right, but it takes this huge step And actually gets worse the value increases then it says. Oh, I need to move to the left I move too much to the right it takes another big step but it increases the function value again and it keeps taking these big steps, but because it's Big and fat it can't take tiny steps It just comes it goes in the wrong direction. It starts increasing the value of the function instead of decreasing And so this is an example for why Some things in deep learning are so hacky. What learning rate should I pick right? If I pick the wrong one I'm actually not minimizing my loss. I'm maximizing my loss. I don't want to maximize my loss This is an intermediate example so instead of a very small learning rate or a very large one I pick something that's in the middle and you can see it's still a bit large So it jumps to the other side of the bowl, but it does converge to to the minimum eventually Of course in real life, you don't have nice functions like that. You have something more complex So you have two bowls. The one on the right is the global minimum. The one on the left is the local minimum It's a bit higher and in this case I start just a bit to the left of the symmetrical center point just tiny bit, right and When the learning rate is small it converges but to the wrong point I wanted to go to the right, but it's sensitive to where you start and so it goes to the left If I increase the learning rate it starts bouncing around right so it bounces and it's very hard to see from a distance But it ends here Not even the minimum not even the local minimum If I move a bit to the right Then I get the right answer it goes to the global minimum But again if I increase the learning rate It actually hops out of that global minimum bowl and goes to the local minimum This happens all the time in deep learning right you your cost function is very very messy It has so many local minima and when you start doing this gradient descent it jumps all over the place and Often comes to a local minimum not a global one So a lot of the the tricks that you see and the tuning and all this stuff with deep learning is Trying to solve this problem and this in general is another big area of research How do we train neural networks in a faster way? So people also do things like half precision or even less for the floating point numbers just to speed all this up And so I don't really care that much about floating point precision. I just want this gradient descent to run fast So the two key takeaway points here are gradient descent is sensitive to the learning rate And it is sensitive to the start point except we are not working with one dimensions You have the the V matrix the weights the w matrix the weights These are usually in the hundreds of millions for large neural networks or even billions So you have billion of billions of parameters and the biases and the the biases for all the nodes and You are working in this 10 million hundred million billion parameter space. You're doing this search in that huge space It's very hard to find even a reasonable local minimum and I'll skip over some of this So the slides will be available. So please feel free to you know, take a look at this ping me I'm happy to explain this stuff But for very simple problems you can So when you have a bowl you can actually do some mathematical analysis, which you generally cannot Which is how long does it take for me to converge? Right and you can see in this case it depends it goes as one over the learning rate So if I make the learning rate very small then my time increases if I make it too large, of course I bounce out so then this doesn't apply and so the key thing is can I tune the learning rate? So I don't bounce out, but I don't take too long to converge and then people have modifications of Gradient descent where you say okay. I want to use prior information of where I was to decide where I should go I want the learning rate to change over time and that itself is a It's a hacky thing, but there are many tricks tricks of the trade but I'll stop there the last thing I'll add is if you have ever heard of back propagation back Propagation is basically a technique to find all the first derivatives So gradient descent needs the first derivative of the function with respect to what you are tuning So we need the first derivative of the loss function with respect to all the weights and To find that in an efficient way you use something called back propagation Means you go forward in it in your neural network and you go backwards and each time you go backwards You compute the derivatives and it's a very efficient way of doing that So you can look at a back propagation section later, too It's on some simple examples, but I think that's it from me for today So we have ten minutes for questions, right? Yeah, yes Yeah, so yeah So the the question is Neural networks for image classification have this property that I can give it an image of a cat And it was trained to recognize that as a cat and I go and I change a few pixels like maybe five pixels one pixel But very few pixels by small amounts and to a human being it still is a cat You cannot detect that difference But you can actually change them in a smarter way so that the cat looks like any other category that you want So the neural network learn to predict cars You can actually change it in a way so that it says this is a car and you can imagine the Risk with that with self-driving cars, right? You take a stop sign and you do something to it and it says oh yield shoot past and I think that the again the global answer is People have heuristics for why that happens and they'll say things like well, we are learning these high-dimensional manifolds and You know their jagged ease things like that which Really don't explain the problem away. So we don't know right and that's a big problem in neural networks Is this problem of generalization like can I learn something that is as robust as human beings? There's a trick to that people use to overcome that the trick is you say while training instead of giving just the image of a cat I'm going to do these modifications So it looks the same and feed that to the neural network to and it becomes more robust But in general they have no one knows So what what Diane just mentioned this is this is the biggest problem Which is why you should not use blindly these neural networks? So for instance, I don't know anyone who would actually entrust something like trading systems on some neural network because What what Sanjay said described so far is the very simplest store forward Forward and backward propagation network and this simplest form No one for any real purpose is using these simple forms So he mentioned something like CNN is convolutional neural networks or an ends etc. These kind of things exist They are you in use today they even more much more complicated than this the math behind this So everyone who knows something has something like Fourier transform knows this this concept of convolution in Convolutional neural network that gets somehow introduced which allows us to recognize features regardless where they are But all of a sudden you have this complexity in the formula there Where then exactly the problem with Diane mentioned happens that if you change something only slightly you get a completely different Output potentially the outputs are not linear in any way or form when it comes to the numerical values of the input so with that you have to really Have some checks in place if you make this make use of these kind of things for anything really serious So if someone like Google is using this for search result the worst thing can happen some nonsense popped up Everyone has experience like things like Google translate doing things about this But what's also beyond the scenes is your neural networks, and it's been out nonsense for the for the translations fine But do you trust your life to this? And generally I would say it's the The approach is that of security in general if I have a function and you feed the inputs Then I want to make it robust in many applications and practice its Not that sensitive or the user doesn't really have the flexibility to give the input So in those cases it's still they're used heavily But as soon as you have something like a self-driving car where someone can manipulate the input then it gets dangerous so and One other thing I want to just mention so we didn't we have time to do these kind of things here If you go to my github page as a project a tiny little project on there's an end of BP a dash BP caught You can check it out. That's a visualization of neural networks So it actually can shows you something I think you can specify when you recompile it how many layers it has how many nodes it has So you should you have few of them because I think with two or three as things it actually works It shows you in a visual way with colors and all these kind of things. How did how the adding additional? Notes work and how the different layers actually work what they are contributing to recognizing certain features So it's only something which can help you to get a better grip for this because as soon as you leave the One hidden layer model, which is already as you showed we have many many lines Which you're intersecting this it gets complicated. Just imagine what happens when you're getting multiple of these layers in there So that gets really really complicated in understanding and as Sanjay said so even those who are actually Researching this have not really understood we have really gotten the grasp of what it actually means and what actually happens at the large scale when this is applied and What I would add to this is if you're interested in this stuff, right? You want to experiment? There are various levels at which you can enter the field I said you don't have to go and learn all the mathematics and then work backwards So often it is useful to just go to something like Kaggle get a data set They have a lot of image data sets and just start training one So that I think is the fastest introduction which is the mathematics is important and eventually you do need knowledge of all this But initially just getting something running it trying experiments tweaking parameters and seeing what happens Is the fastest way to get familiarity with all this stuff? So that's normally the quickest entry point of course you can go and look at the theoretical work You can work on the infrastructures to a great libraries by torch. It's from NYU and Facebook guys. So Personally, I think it's much better than tensor flow for research specially so I would say try that or try tensor flow or you know Make pull requests or something that also helps you really learn the guts of how these things are implemented And it's completely up to you where you Where you start so and one last thing I want to answer is that Sanjeev mentioned at the beginning that with that we are using GPUs for these kind of things. So GPUs just like many other things which came before it are good at linear algebra and now we have seen This is pure linear algebra. There's nothing in there So normally the representation which you saw here is is the form of matrix vector multiplication So if you do it with a little bit more complicated things you actually end up with with tensor operations And that's guess what? companies like Google have implemented their own hardware for these kind of things and GPUs got better over the years, but they're not the most specialized hardware for this But they are much more specialized in this than a normal CPU itself, which does not perform more than let's say even today 64 operations in one single instruction or something like a GPU can do 1500 Operations in one single instruction and that's the big difference. There is where the the performance benefit come from All right, so any other questions, so you're at the tough ones. I saw about 20% sneak out Thank you for coming. I think oh, yeah, sure What are the new mastos in AI like playing games? I do achieve that you have like five Let's team team games, especially you have like five how to say it like Independent networks which cooperate with each other or does it work differently? How actually you achieve that? Sorry So as far as I know and I'm not an expert in this at all Things like Alpha go or when they I don't know about the latest starcraft one, but they often have Two networks playing games with each other and that's a way of generating more data because you have I don't know That's your question, but you might have 10,000 games of chess or go or something and you train two neural networks and then you make them play against each other and That generates a new data set of you know as much as you want billions of games and they learn from those So I don't know about the starcraft one for example Yeah, I said they have basically the same certain usually you don't have more than one of these systems internally to making the decisions so What you have with the existing successful ones is similar to what Sanjay to suggest it is that the problem is the dearth of valuable and fast enough Opponents to actually train the system and what they ended up with the most of the successful systems is to Have a kind of system where they are using genetic algorithms of some sort to Change the existing system some way from the current state creating a set of new opponents basically and And run them against each other and then pick the best ones and go forward for that That that was for the last for the last thing. So there's this star alpha star stuff So for Alpha go it was a little bit different in that it simply played the same Yeah, the same algorithm multiple times and had them learn from each other One one after the other and it got simply better over time The main benefit is simply that you don't have a human involved anymore and with the last so the the last Invocation of their basically algorithm which managed to run both go and chess and also Also some other games in the same algorithm really well, they didn't even have to do anything Specific to the games. They just had to implement the rules and say yeah here and go And but this only worked because they could run Hundreds and thousands of millions of games against each other. This is where the my multiplicity comes in nothing else as far as I know and And there is a general problem a human being can Like a kid I'm guessing will look at an apple once and know it's an apple right and then they'll remember that For neural networks, you have to give it tens of thousands of images for these games You have to somehow do millions of games a human being will play an Atari game a kid will learn it in probably five minutes Yeah, but at the same time so what Simon mentioned is the learning part But what became apparent in the last couple of years is that it's even just as important to forget something If we forgot learn something which was bad You had to somehow in the in the future perhaps forget about this because this doesn't immediately go away So these kind of things are but yeah, I would say like so that that's an open problem right now Which is why do humans learn so fast and why are these are too simplistic? So one thing that's missing in everything here. There's no notion of state So when you feed in one example, and then you feed in another example The network doesn't know what was fed to it ten minutes ago And so recurrent neural networks have some measure of state so forget gates or remember gates And then they can basically reason and say I saw this ten minutes ago and this five minutes ago And so I should do this now, but I Mean a lot of it sounds fancier than it is when you look at the guts They're very intuitive ideas, and there's a lot of trial and error you see what works and a lot of it is what for these spectacular successes is just around the efficiency of the implementation and What they can actually run efficiently on the hardware they have and then just let it brute force run for thousands of hours All right. Thank you everyone