 So that actually looked pretty linear, and it looked like the change is very small. In fact, only a finite overall weight change is often needed to get all training samples right. In many cases, as the network gets small, it gets very large, that change becomes arbitrarily small. Let's think carefully about optimization. What we have is that the weight at the next time step is the current weight minus the small constant times the gradient of the loss after the weights, which we do with minimizing with stochastic gradient descent. Now let's see what we can understand about that, and here's the intuition. It's a really deep topic, and we can't go through the details of that, but we can give you somewhat of an intuition there. If learning is linear, then the order of learning from x i or x j doesn't really matter, like I could first learn from data point x i and then from data point x j all the other way around. So the order doesn't matter because we will be making very, very small changes. Now in a way we can understand of this learning as influences between having learned x i on what we will, how we will process x j and vice versa. So each data point can be seen as exerting an influence on the processing of each other data point. Have you seen something like that before in machine learning? Yes, you have. This is exactly the way we write about support vector machines. We will see that there's a deep link between them. So we now have the neural tangent kernel, which is a function from input space times input space into r. So when should data points interact? Well, if their gradients are not orthogonal to one another. So we can now say that the kernel is the gradient after theta p of f times the gradient of f of y. So we have those two. And now with these, we now have a defined influence between the two of them, which now allows us to reformulate the learning. So now this gives many new computational opportunities. It allows us to work with infinite width neural networks. If we have infinite width neural networks, their learning will only introduce arbitrarily small changes. We can then use the kernel trick to formulate how that learning will unfold. We can understand the dynamics of learning and interference and the width of the network and prove a wide range of statements about learning. Now what is the upshot of that? Now this is like for a lot of people and applications, this doesn't seem to be all that relevant. So it gives us an intuition on under which circumstances data points will interact with one another and when they want. Namely, if I have two data points that in a way share gradients, where the gradients are correlated between them, then learning one will affect the other in vice versa. At the same time, if data points have gradients that are orthogonal to one another, then nothing happens there. They really can be learned at the same time with minimal interference. And that tells us something on how we should think about generalization and it tells us something about the design. You can say I want to design my neural network, show that certain data points share some learning with one another and that certain other data points are processed really independently so that the learning doesn't interfere with them. So I think that neural tangent kernels really give us an interesting way of thinking about the dynamics of learning.