 So, what stochastic depth is doing is, it tries to ignore adding identity or adding f of x sometimes to the end of the result. Let me clearly explain. As you can see this, let's say the vertical lines are layers and the number above the vertical lines is probability that you'll add the residual to the output. So, what stochastic depth says is, let's say there's a probability. I'll draw a unique number. If my unique number is lesser than the probability, I'll add my residual to the output of the layer. So, how does this help? One and a direct help is regularization. So, if you keep on training your net continuously over the images, it starts overfitting to the image. You doing like what every other method which says drop a parameter or anything else does. Dropping a layer also helps you with regularization. And what else? It helps with the speed as well. Why? So, when you're picking and choosing which layer to run or when to run a connet, you're ignoring running connet certain times also. You're just sending the identity from previous function through and you're not running the residual. So, from the computation, around 70%, you're only running connet 70% of the time if you're assigning decay probability to a layer. So, this, so, so one direct output of it is accuracy. You can see that till 400 epochs, residual networks without stochastic depth and stochastic depth are performing same. So, after 400 epochs, what is happening? The regularization part of the advantage comes in handy. And the rest nets with stochastic depth give a greater accuracy. And what else? As I told, speed up. You can see almost, so, I'm giving test errors on y-axis and survival probability, the numbers above the vertical lines on x-axis. By choosing what kind of a test error you want and what kind of survival probability you want for the layers, you can see that you can get a speed up up to 2x. So, why the speed up? Still we are able to train it in one week. Why the speed up? Let's say tomorrow you want to run a ResNet which is 1200 layers deep. So, 1200 layers deep is the difference between months to couple of weeks. Probably it takes you to run a 1200 layer ResNet for four weeks. And stochastic depth can run it just in two weeks. So, boring graph on vanishing gradient, if you actually care about that. And other methods which I want to leave you with. This, because I have couple of minutes, I'll try to, I'm not doing justice to the methods, but I'll try to explain what they are. The recent past, they have published something called a swap out method. Swap out method is also a regularization method where they add residual feet forward and drop out as well. Fractal nets which are hyped called ultra deep neural nets. Which say that we don't need residuals by stacking these fractal nets together. We'll still give a output. And residual, the third one which says that instead of just adding the residual from the previous layer, you can add a ensemble of previous layers residual also to the current layer and give you the output. Last one, yeah. So, I've done implementation of the stochastic depth method and residual learning. It's on my GitHub page, that's it, okay? Thank you, oh, yeah. So what is the problem you have used for, you know, used this residual? Can you speak loudly? What was the problem you were trying to solve with this residual neural network? Was that the ImageNet problem or was it something else? So, primarily it was the vanishing gradient problem. So, it was the simple motivation is, from eight layers to 22 layers, I could go easily without facing any problems. But when I'm trying to do that after 20, when I go 50, doing this similar thing, my accuracy is going down instead of going up. So, that was reduced to the vanishing gradient problem. And to solve that, they came up with the residual learning idea. Oh, ImageNet, ImageNet and CIFR. Hi, this is Dilip. So, I had two questions. One, why should those adding the residue help? And two, in one of the graphs that you shot, maybe around 200 epochs after that, it was a sudden drop. Why was there a sudden drop? So, the sudden drop is because change of learning rate. In usual methods, what you do is you write a callback for changing your rate of learning after changing some certain number of epochs. So, that was the sudden drop. And intuition behind residual net, like it's still the risk of sounding stupid. Deep learning is still a very black box method. But if you ask my intuition, it's like you're adding the state of a previous layer to give the context to the current input. So, this will help with the gradient cause back propagation method. You need to have a context of the whole thing. And you're optimizing the whole function. So, can you speak up please? How close is your LSTM analogy? That feeling on how close is the LSTM analogy? Okay, it's widely discussed on Reddit. If Reddit is not a very credible source of discussions, there is a publication by MIT PhD students, which draws a close analogy between residual nets and RNNs. So, I could offline give you the paper as well if you like it. So, yeah, thank you.