 which to a non neural network person kind of sounds like paint, but it's not, it's a new paradigm in training neural nets. And Anuj is going to tell us all about it. Thank you. Thank you so much. Today we're going to look at something called synthetic gradients. This was a work. So what we're going to largely look at is a new way of training neural nets as compared to the known paradigm of back propagation, which is at the very heart of training any neural net. I want to make this clear up front. This is not a work that I have done. This is a work that came from the deep mind team last year. I was basically surfing around some stuff on Reddit when I came across a thread, which was having some interesting discussion around this work. And that kind of caught me little interrupted and I tried to read up this work. It took some time, I would be honest to even understand what was they doing and why was they doing the one thing. And then I realized that while many of us are struggling with the basics of deep learning, trying to have some neural net, some deal going for us, there are a bunch of people who are working on the completely other end of the spectrum who are trying to do how do you train massive neural networks. And this is what this work is going to be all about. So, how do neural nets learn? Ultimately, we both learning as an optimization problem. And at the heart of training any neural net is what is called as back propagation. The idea being you have a loss function, you understand the gradient or how the loss is going to change when I am going to change one of the variables by a small amount. And then you use that as a signal to update your weights, which is the gradient. It is also called as error gradient. This was proposed way back, difficult to say who was the original author of this work because there was a lot of work that came around at the same time. Anybody interested in the history can look at this link. So, let's do a quick refresher on what is back propagation for the interest of everyone. So, what I have is a simple neural net which has one node which is doing an addition. It along with another input node is doing a multiplication. So, this function, this small network you are trying to approximate is x plus y to z. And imagine we are starting with values of x, y and z as given below. So, what is that we are interested in? So, what we are interested in that how the function is going to change as a rate of x, y and z. By how much is going to change if I tweak my x, y and z by a small amount. So, let's try and compute it from base. So, what we know is the partial derivative of f with respect to f is going to be 1 which is easy. What is partial derivative of f with respect to z given you have the function f as defined. We know in a simple x plus y in this particular case for these set of values it will be 3. Now comes the partial derivative of f with respect to q. Again pretty easy because that is x plus y. So, you know it is going to be z which is minus 4. It is here because you have the error signal which started from one end of the network which is back propagating to the origin. We will start to use the chain rule. What is this is it basically says that f is a composition of q and y. So, the very derivative of f with respect to y will also have those compositions which we know again in this case is z minus 4 and similar is the story up here. So, this is what is called as the back propagation. So, the error started from one end of the network and the error is back propagating to the origin of the entire network. Now in a nutshell if you think what is happening is where these boxes represent nothing but the layers of the network. So, f i is the f i layer of the network. We have a forward flow where starting from the input you are doing a forward pass on your network and each layer is making its computation as by the very definition of f and then it is passing on to the next layer which is using that as the input doing the computation which is shown here by the black arrow. So, starting from f i it goes to f i plus 1 it goes to f i plus 2 so on and so forth. On the other hand what is happening is when this flow is going to go till the end of it that is when you are going to have the output which you are going to compare against the truth that is where the error starts. The error is the gap between what was predicted and what is the truth of it and what you start to have is the back propagation where the error signal starts to go back all the way down. So, you have the layer i which takes the gradient from the previous layers thus the computation which is the multiplicative of its own and then passes on to the next layer. So, what we have is the backward flow which is shown by the black here and we have the forward flow which is shown by the green here. Now, this is how every neural net that is there is trained. Can you think is there a problem with this? I mean this is something we all have been doing anybody who has trained any neural net at the heart of it this is what we do. Is there a problem with it? Yes and here is the thing. Thankfully what it is introducing is it is introducing locking between the layers and what do I mean by that? So, if I model the entire network as a directed graph from layer 1 to layer 2 to 3 which is l 1 to l n being the n layers one there is a forward locking what is forward locking. Now, this layer cannot compute this forward signal until it gets the signal from here. So, the layer l i needs to wait for the signal to come from l i minus 1 which in turn needs to wait for l i minus 2 which in turn needs to wait so on and so forth. So, this layer in a way is locked by all the layers on all the way to down. So, this is what we are going to call as the forward locking. Similarly, it is the story for the backward locking. So, if you remember what was happening we were computing the last year we get the gradients we percolate those gradients back to this layer then to this layer then to this layer so on and so forth. Which means in order to have how much error is happening at the highest layer I need to wait for the error signal to come from the next layer which is l i minus l i plus 1 which in turn needs to wait for l i plus 2 until l n. And this is what we are going to call as the backward locking. Both of them together so in order to update. So, each layer has their weights their parameters I want to update these parameters I can't do until I have the gradient the error signal. So, what is going to happen? I cannot update until all the layers subsequently have done that and have passed on the signal to me. And all of this which is the forward locking the backward locking and the update locking is forcing the network that these layers be trained in a sequential memory. You cannot go beyond that. Now, all of us have been doing deep learning for some time. I mean it all works fine. What's the big deal? I mean the kind of networks most of us train are relatively small. There are some efforts where the kind of networks they train are massive. I mean look at something like auto ML project from Google which has a couple of hundreds of layers in the network which are being trained. Now imagine such a network being trained where the layers are locked and the entire network needs to be trained in a sequential manner when I say it is layer by layer. So, you have a forward pass and then you have a backward pass. The question they try to ask is it is all good for the small networks? If you have massive networks this is going to create trouble and can we do away with this? And the idea being can we learn each layer in an asynchronous manner? Can I break this locking? And this is what the essentially this paper does. This is what they call as the decoupling of the layers. Can I break this locking mechanism that is inherent to this back propagation algorithm? How does it help? As I said this is if you have massive networks you can train each layer independently which means if you have a cluster of machines rather than having these layers being locked and this dependency being there you can train each of them independently. But how do you do that? So, first we will try and do away with what I called as the backward and the update. So, what is very clear to us is that if I want to compute the gradients of the i-th layer I need the gradients which are coming from the i plus one layer. I mean without that I can't do anything. But because of that very fact I need to wait for the whole forward pass to go from this layer to the last compute the error then come back then finally get my error and I am not okay with that. But if you think carefully one and two are contradictory they appear to be contradictory I have to get the signal from the next layer but at the same time I don't want to do that. So they came up with an idea that it is possible to do away is if I am willing to make some error in my computation and the idea was to follow. So, for the top I am going to consider delta i to be the actual gradient if I would have done a back propagation in the normal way whatever gradient I would have got for the layer i is going to be delta i and what they do is let's not compute this delta i instead let's try and estimate the gradient whatever it is. So how do I do this at first place? We will see this later. I mean let us first see if we have such a facility available whereby you have an approximation of the actual gradient how are you even going to use it? So what we assume for now is there are some magical boxes that are available to every layer to which if this layer gives some input it is going to return me the approximation of this. So I am going to assume for some magical reasons I have these oracles available to me which I am going to use. So every layer li has access to a magical box which when queried with some input it will give me an approximation of the actual gradient. This estimate is what they call as the synthetic gradients. So what is happening here? I have a layer. This layer does a forward computation it produces an output hi this is sent to the next layer and at that very instant this hi is sent to the oracle. So each layer has an independent oracle to them and this returns me the estimate of it. Now the moment this layer has the access to this value which is instant I am going to create this to be my actual gradient and I am going to update my weights I am not going to wait for anything at all so now it has become instant I just do the forward computation I just send it to the magical box it returns me a value I treat this to be the actual gradient and just execute the computation that I have to do. So each layer uses the synthetic gradients to update their weights paring the last layer and why? because last layer is going to produce the output which is going to be computed against the actual output the truth you are going to get the error and you can instantly anyway get the gradient for the last layer so it is only the last layer of the network where the output comes where the synthetic gradient is equal to the actual gradient rest all the layers going to work with synthetic gradients there is no actual gradient to them so now that we understand if we have synthetic gradients how can we use let us try and address the question how can we produce those synthetic gradients what are those magical boxes magic does not happen in real life so let us try and build something for that so what we had is we have the layer which takes hi as input so we have this box which takes the output of this layer as input and it returns me the gradient now what is that it wants to do what is that this magical box wants to achieve what it wants to achieve is that it wants to be as good an approximation of the actual gradient as possible so I want my synthetic gradient to be as close to the actual gradient as they can and how can I do this use another machine learning model to do this so what are we going to do imagine there is a machine learning model that is powering this box it takes some input signal which is the output of this layer and it produces this synthetic gradient and at all times this model is trying to do is get this estimate as close to the reality as possible so what they show is this can be done using a simple shallow network you don't need a very heavy machine tree but there is a problem what is the problem the problem in this entire thought is I want mi plus 1 the magical box to correctly produce the synthetic gradient the way it is going to do is it is going to learn from the actual gradients the actual gradients are going to come from the entire flow of the forward pass and then the backward pass where I get the truth if this was to be done why did we at all do this how do you circumvent this am I clear as to what has happened until now so the problem is how do I get if I have to wait for this then there is no point of doing all this mumbo-jumbo so they said that you don't have to really do that if you are willing to allow some more error you can circumvent that and how do you do that so imagine this layer so this layer also had a magical box which is mi plus 2 this layer did a forward computation sent it to the magical box which sets the synthetic gradient all I do is I take these synthetic gradients of this layer just back propagate through this layer and whatever comes out read that as the actual gradient so I have delta i which is the approximation of it this is the true gradient and the way I approximate this itself is by taking the synthetic gradient of the next layer and back propagate so these are the three key terms we need to remember am I clear with this part now so what has happened till now so this is a very nice image they had put up so I have a layer which takes a forward signal it does a forward computation this computation goes in two parts one to the next layer second it goes to the magical box the magical box receives this signal and just returns the synthetic gradient this gradient is used to update the weights and finish off but at all point of time we want this magical box to make better approximations for which it needs to see the actual gradient which I further approximate by taking these synthetic gradients of this layer and just back propagate them now if you think it's only the last layer which is seeing the true picture all the other layers are working with cases now what you have done if you think is you have decoupled these layers each layer is now completely independent and can do the updations instantaneously so the only one it is dependent on is the next layer but that layer has access to a magical box where which you get the instant gradients and get back so what you have achieved if you have broken the essential dependency that we saw where I need to complete this four full forward pass and then full back pass to be able to do the updations all layers can do updation in one step at same time brilliant now that is a question I mean you have done so many approximations great as an idea it works but what about the quality of the gradient so what they did is they took certain networks which had multiple layers to them and they took to the most popular datasets from the vision world where you have very well established benchmarks they ran the normal back propagation and this normal back propagation this is the error that is there for each of those networks so this was a three layer network which produced an error of two four layer, five layer, six layer it's a well known fact if you keep on increasing you will keep on getting better so this is what is called as a fully convolution this is a convolution where here you have fully connected layers in the end here it's all convolution but this is the DNI the T coupled neural interfaces this idea that they built if you look what is happening as I increase the number of layers the gap in the error keeps on increasing so here it's pretty close it increases slightly, it increases more it increases more for the other dataset the story is very similar so all the approximations that you made are actually making you pay in the quality of the training which in a way is a setback because you ultimately want to replace the back propagation so they came up with a nice idea they said to this magical box what signals to send is in our hands can we send more signals and see if it can do better so what they did is what they call as DNI conditioned on labels so what we saw is where the layer was doing the forward pass and whatever is the output was going to the magical box what I do now is along with that answer I also send the label of the data point so that is what is called as DNI conditioned on the label instead of giving a single signal as input now give two signals one is the forward computation output second is the signal of the class of the data point and what they were able to show is that for the error now comes much more in control as a matter of fact it does pretty well if you see all the numbers across both the datasets now this is from the error quality so what you show is this technique when given along with the label given to these magical boxes the error is in control but what about the amount of time it takes to converge maybe in order to do this you do a couple of hours while it takes couple of days here which will not be a good news so that is the second aspect of the quality more and more different so they did more and more experiments around it let me help you understand so the grays and the blacks are again different networks of sizes the number of layers and if you see all these so this is basically showing how fast it is able to converge to the optima so if you think, see carefully the whites and the blacks and the grays are actually sitting over here I hope you are able to see to do now this is the DNI the original idea that we have the blues and the grays now they try and converge but even after a lot of iterations I am only going to come here I am going to come here okay so this is where you are sitting your normal back propagation this is where the DNI idea is sitting what about the C DNI if you are able to control the error can it also take care of the speed at which the error can converge and this is what they found they found that the reds and the yellows as a matter of fact surprisingly the yellow is even below the gray for most of the citations so if you look at the curve of yellow here it is all above gray but somewhere here it has to go down much more dramatically now what have we gained out of all of this they seem to be hinting at that if you increase the number of layers things surprisingly only start to get better along with the speed of you are getting and so are we done this was a whole idea a bunch of people as I said most of us are struggling with basics some of them trying to see if massive neural networks can be trained in faster ways this whole idea doesn't work so we got rid of the backward and the update locking but if you think carefully our layers are still forward locked which means no layer can do the forward computation until it gets the signal from the previous layer that is the part left and what they do is they apply the same technique even in the forward pass so this image might look little daunting but let me explain what is going on so this is a data point it has a label to it these are my layers each of these layers has a magical box to it so this comes to this it does the forward computation this goes to the magical box it generates the gradient this is then passed to here and so on and so forth now what do I do with the forward passes I still need to break that so how do you do that have more magical boxes that can predict even the forward computation so each of these eyes takes the original label and tries to predict what the previous layer would have sent so you have HI's which is the true forward value you have the HI hats which is the approximation of it and this box is trying to reduce the gap up here they further show in the paper that this whole idea can be applied even to RNX where you have back propagation through time things only become complex I would not put that because the image and the maps starts to go really complex but the very same idea can be applied and it can be worked through this is the original paper that came there was a follow up paper from the same team I tried to implement this idea I have put up a GitHub repo any of this who is interested can better to facilitate a better understanding where what I have done is I have converted back propagation to objects each layer starts to act as an object so I have shown how you can implement the normal back propagation to it and then show how you can implement the same idea as a synthetic gradient so why I wanted to put this talk across is that when I first read this I thought this was a bizarre idea I mean for somebody to even think that something which is at the very heart we have all taken it for granted they have shown that there are limitations even to that algorithm and I have tried to see if we can have work around and they have used machine learning to solve the problem of machine learning which was interesting to me so this is the end of talk from my end any questions happy to take it up just a second we will get you a microphone does that increase the computational resources required by three times because our two times rather so when I showed this here everything else has been kept same so the only thing that has changed are the network configurations and the training algorithm so essentially what you surprisingly find beyond certain point it actually starts to do much much better when are you approximating inputs as well is in the increase the error so they show that it is not the case so using the similar idea where you use this to approximate a better of this they actually show that the error remains pretty much in control I have one quick question when you say that the time taken for the complete network to perform the modeling task talk about iterations now one iteration in one complete model can take one day and one iteration in this model can take three days so in terms of time in terms of seconds how well can you say that this system will perform better than the regular propagation because I see that there are about i1, i2, i3 which have to be modeled at the back end before you even start doing this task not really right you can start this up front because this signal goes to all of them in parallel in the same go so one thing is initially those approximations will be bad of course they will be but the whole idea is that with time they are going to see the actual value and going to refine and they also learn in parallel so here one massive model and each of them is a single neural net which learns independently so as she said in this figure there are seven neural nets which are training in parallel to achieve this but the point is you are leveraging the power of many lightweight models to be able to train each of them in parallel this becomes much more clear imagine you are training a resonant 150 layers it takes huge amount of computational resources if you look at their meta-ML that they are going after one of the articles seems to claim it has close to 500 layers how do you train such a thing it's a beast of its own but the point is each of them can work independently that is the crucial point that is happening here the most needed is the initial signal and then whatever is if this signal was fed to this what would have been the reality for this to learn this is the whole world it does not need to bother with what is happening in the rest of it its job is only to take this signal and then see what is the truth of it and then build a better approximation so that I can better predict in the next iteration I am sorry Surya do you have any members on actually in terms of seconds so is it comparable with 2x half of the time so the paper I mean I hope you understand it was not possible for me to train such kind of network the kind of resources you need my pocket does not allow that but what they showed in the paper is that the actual time is only a utmost that they saw it was utmost a 2x increase in the time per iteration so what is the game then we are getting we are losing on accuracy and we are spending 2x the time see it is per iteration it is only per iteration the number of iterations you need to converge is actually going lesser no its only 20% less or something like that number of iterations so so that converge at 500 is converge at 400 yeah but then so currently actually they are in the process of trying these with 40 layers and 50 layers I mean this was only an initial work where they tried with until I think 8 or 10 layers and they saw promising numbers coming out of them so that is the bet they are putting on so there is more work going on of course I mean this cannot be the end I mean they are gaining they are running for networks with 55 which have much larger number of layers so and also how promising is this on higher number of layers like 100 layers on this today we go back to the slide where it shows numbers actually it shows that this is right yes so it shows actually as the layers increases not just the difference but actual accuracy is going up actual error is going up no error is coming down this is 1.8 on the DNA see of course DNA was bad but they also came up with a fix that just by having one additional signal which is a label of the data point which is cheap I mean that is there with you I can actually come down and do better okay so but there are no published results on higher number of layers not yet I mean those are the only two papers that have come across so my question is because you have so many neural nets acting in parallel right would you always ensure that you are going towards a minima all the time because it is all over this you can suddenly have an error which can cause the cost function to climb up and then so it somehow I don't think it's I don't know what are your thoughts on that so two things one either prove it theoretically that it is always going to converge and do it better I am not sure there are any way of attempting that they are depending on the age-old idea of machine learning you run a lot of experiments you start with different conditions and see if they if there are aberrations so until now they have not seen anything like that does it like zigzag a lot and does it suddenly spike up come down see that is more from the point of given an error surface how well can I do over there and there have already been advances much beyond what we know as a gradient descent there have been far more sophisticated techniques to be not only to ensure that you can avoid certain pitfalls but you can quickly go to the goal I mean see the loss functions of neural nets are not as cute as we have seen in many other ML things so here also they have not found anything which is abnormal I would say I mean it's a well-known fact they are never the nice beautiful world of the convex but that is true for any neural net that we train let's talk about the really large scale datasets and so on where you typically apply dropouts and so on to optimize the performance and so on you know the nuances of bad profile and dropouts and so on so at this practical setting where does this kind of matter stand see one I don't see how the size of the dataset is going to affect this technique I don't think so that has any bearing I mean the idea only being because you are doing a lot of approximation then those approximations are only going to increase more if you have more layers each one is starting with a bad approximation of the next one and so on and so forth it's only the last layer which has always been seen the true picture the initial ones are promising the larger ones I would say that in the work in progress it's only much later can be what about the dropouts right so very known that randomly some nodes are being turned off and so on now I mean yeah no that has not still been trying I mean see even around dropout dropouts there's only so much that we know why they work and how they work I mean large parts are still because it's come from Jeff Hinton people just follow to be actually be able to understand there's only so much insights into of course there are certain explanations but still there is no clean explanation has they tried in this the answer is no there's some confusion to bargain but it's a different discussion but the most interesting part they found is that the same technique applied to RNNs which we use back propagation to time also works beautifully one more last question this is typically the DNNs before networks so let's say you do a data parallelism and you know there's a distributed optimization which is like optimizing for your parameter updates and so on now if I compare that method with these methods where do you see so initial attempts have been all about the fully connected networks I would say you having setting where your network itself is dynamic your network itself depending on certain flow that takes a different path and all those things this is this is a very long shot but around that there is nothing much at least to do I mean there's hardly anything on this there's just a couple of credit cards that are available where these discussions are available for sure there's a question over here can you make a comment I have just one comment to make in gradient descent the neurons get saturated over the time and the gradient gets dying because of this back application so I mean I would just like to know your insights of whether that problem will be solved when we are decoupling gradients altogether so vanishing gradients is not just for this I mean if you look at the entire world of RNNs vanishing gradients are there and why did LSTM do better than RNNs because one of the major things I mean gradient explosion is something that is taken care by the clipping part of it which is a very standard technique where the why LSTM did better is because they made the gradient additive instead of multiplicative by having a memory inside it that is how the addition happens so even if a lot of small numbers add up they can still give you a significant number of LSTM than the various models that are coming in now that does not have any role to play here because the gradients are still in a way flowing see it's not that you have broken the hierarchical dependencies that are there in the network the hierarchy still sits I am still dependent on the next layer for my synthetic gradient to come forward but that is always the dependency goes it does not go to the forward and that's all so that should not have any effect on that right? no? I mean LSTM have already solved right? so if you are training RNNs or which is backward propagation through time it should continue to do fine I mean I don't see any reason why why do LSTM are very focused on recursive neural networks sorry, recurrent neural networks that's where if you look at LSTM in the entire genre that has come across by setting there are highway network and non highway network and stuff like that LSTM have moved past that problem of gradient vanishing and the whole idea being there is a memory where these gradients add up rather than getting multiplicative so if I add a lot of small number I can get a significant number I am never going to multiply a lot of small numbers to go much more insignificant okay, sure, thanks we are just about out of time for questions, I'm so sorry this is such an interesting and very deeply technical talk that I am sure that those of you who are into it probably have a million questions and perhaps Anuj take your questions offline later we're going into another talk next, thank you very much