 I'm Alex Gessner from the University of Tübingen and I will present our recent work on gradient inference with Gaussian processes carried out with my collaborators Philippe de Rours and Philipp Henig. By the end of this video you will see how Gaussian process inference with gradient observations can be made efficient in high dimensions. Let's consider a standard Gaussian process regression task. We want to learn a mapping from d-dimensional input to a scalar output using n evaluations of the function f. Standard GP inference has the well known scaling that is cubic in the number of evaluations n for computation and quadratic in memory. GPs can also be conditioned on gradient observations. They increase the computational load to cubic in evaluations and dimensions and quadratic in both d and n for storage. In other words, one gradient evaluation is equivalent to d-function evaluations in terms of hardware constraints, but it only provides information about the function at a single location. Because of the high cost, gradient observations in GPs are not very common, especially as the dimension grows. In this work we show that gradient inference can scale linearly with the dimension both in terms of computation and memory. The high level takeaway is that one gradient observation can be cheaper than d-function evaluations in high dimensions. Let's see how this works. It turns out that the kernel gram matrix for gradients has a lot of structure for many popular kernels such as stationary or dot product kernels. On the left you see such a clearly structured matrix for three gradient observations in 10 dimensions using the RBF kernel. The structure originates from applying the product rule of differentiation on the kernel. On the right is the same matrix decomposed into its constituents. This decomposition is beneficial to use together with the matrix inversion lemma. Instead of inverting a dn by dn matrix on the left, we now only need to invert a chronicle product with an n by n, that's k here, and a d by d matrix that is typically diagonal for the RBF kernel that's the length scale matrix here. And further we need to invert an n square by n square matrix. The decomposition can be used to significantly speed up the inversion of a matrix when applicable. I would also like to point out that several of the matrices contain a lot of white space. We can use the sparsity to further speed up the computations of the woodbury inversion. The proposed decomposition is useful when the number of observations is smaller than the dimensionality. You can see this in the figure where lines that lie below the horizontal black dashed line indicate that the decomposition is faster. For d much larger than n it can lead to significant speed up. In the paper we also show how it is possible to avoid building the gram matrix explicitly and we apply these improvements to algorithms for optimization and sampling. I hope you are now convinced that dp inference with gradient observations can be made efficient for high dimensional inputs. Thank you for listening if you got interested and want to learn more about the technical details and example applications we invite you to check out our paper and the code repo that you can find linked below.