 uh wait what just happened well um i hope you all came back before despairing too much of it what did we do we made the network deep upon initialization whites will exponentially go to zero infinity the same for gradients let's let's see how we can see that so we have the weight and let's say we replace the weight with alpha to the power of the weight and we do that for each of n layers what will have happened well y is now alpha to the n times y so if we even a little bit off then activities will go to zero or to infinity exponentially and therefore the we will hit the boundaries and the gradients will vanish and if they vanish we will stop running so meaningful initialization is often the key if we want to have success at training neural networks what do we want we want activities around zero why is so that for each of the reloose we basically have some samples where we're in that like flat region to the left steep to the right or alternatively when we do linear things we want to be in the range where every where variables are basically meaningful and where gradients are of the same size of as our initialization so we want activities around zero so we want about a standard deviation around one we want gradients to be around zero and standard deviations again around one why because that's the scale for which our abstraction works well no and what we will now do is we will use this Xavier initialization we will initialize it the weights to be uniformly distributed with with with in the range of plus minus the square root of six divided by n in plus n out where n in is the number of incoming and outgoing units so follow the derivation for that in the co-lab notebook it's really quite beautiful