 Can you see my slides? Yes, we do. OK, excellent. So first of all, I wanted to thank Jean, Florin, and Mateo for organizing this ecologically friendly workshop. And it was really great. I have a feeling I learned a lot during this week. So today, I wanted to share you some of our most recent results on learning of discrete graphical models. And my focus today will be on how we can use neural networks as universal function approximator to drastically boost the computational efficiency of the corresponding methods. So let me start with a quick reminder on the graphical models, which are useful tools for describing probability distributions that have a certain conditional dependency structure according to a given graph. And graphical models in particular have two important properties. The first one is the factorization property, which translates into the fact that a positive probability distribution can be written as a Gibbs distribution with an energy function, which is the sum of local terms factorized according to the clicks of the graphs. And the second important property, which is important for structure learning, for instance, is the so-called separation property, which states that a variable condition on its neighbors is independent of the other variables in the graph. So informally, we can pose the graphical model learning problem as follows. So we observe n independent samples from this distribution, mu sigma. And we want to learn its structure and parameters. So other important dimensions of the problems include the number of variables that we have p. And so the purpose of this talk is to explain things for the discrete graphical models. So we assume that each variable can take one of the q discrete values. So that's the alphabet size. So I put here some notable prior work in computational efficient learning of discrete graphical models, which starts with a seminal paper by Guy on the mutual information-based reading methods that were later generalized by Hamilton and others for more general discrete graphical models. And there is also a class of other methods which are based on convex optimization, in particular on conditional likelihood and traction screening method. The typical setting of graphical model learning is as follows. So usually, we are talking about a parametric estimation where we assume that we know an exponential family of the distribution we are interested in. In particular, we know a set of basis functions denoted here as GK, which acts on subsets of spins in our system. And what we don't know are the parameters associated with this model denoted here by theta star. And that's what we want to estimate. So we observe independent draws from this distribution. And we want to recover the parameters, estimate them to a certain error epsilon. And typically, we want to work with reasonable models, which means that we assume that the L1 bound on the parameters associated with each of the nodes is bounded. And we have a certain finite prior on it. So how does one solve this inverse problem and learns these type of distributions? So one method is called generalized rise, where rise stands for regularizing traction screening estimator. And so the expression is given here. So essentially, this estimator operates on a loss which is given by the empirical average of the exponential of the local energy terms taken with the opposite sign. This problem needs to be solved for estimating each neighborhood of your graph or hypergraph. And it is a convex function which can be efficiently minimized with using entropic descent type methods. So a quick intuition behind this estimator where it works can be obtained by looking at the infinite sample size limit, where this empirical estimation over samples becomes the estimation with respect to the underlying measure. And with a one line of calculations, one can show that the minimum of this convex objective function is achieved at the point where the local parameters correspond to the true parameters of the model. And this is the point where the interactions are essentially screened, which explains the name of the method. So of course, what is most interesting is not the infinite sample size limit, but the finite sample analysis. So can we learn reliably a graphical model using a finite number of samples? So a complete statement of our results with the proofs can be found in our archive footprint. But for the purpose of what follows in the talk, let me just state the result informally. So what we know is that GRIS learns an arbitrary graphical model with high probability up to an error epsilon with a number of samples, which is proportional to the logarithm of the dimension of the problem, so logarithm of the number of nodes. And the factor which scales as Q, which is the alphabet size, so how many variables the variable take, how many values the variable take, to the power L, where L is the maximum click size in our factor graph. And so this can be done with the computational complexity, which is P to the L, so dimension of the problem to the power of maximum interaction order that we have in our model. So so far, I gave kind of a very general statement, so let me give a concrete example. So let us work for the purpose for the sake of simplicity. Let us work with the binary variables, which takes values minus 1 and plus 1. So typically, if we don't know what the distribution is, one approach would be to set a complete basis function hierarchy, which can be, for example, a monomial basis functions, which corresponds to different interaction orders, which is very appealing in physics. And then we could try to model our distribution by progressively adding higher and higher orders. So we can start with the first and second orders, which corresponds to the famous sizing model, and then we can add three-body interactions, four-body interactions, and so on, up to a certain order when we are happy with our choice. And the interactions bringing loss for this case would take the following form. So to estimate the neighborhood of the neighbors of the variable i, I will need to solve this convex optimization problem where the loss function is given by exponential minus and all terms from the energy function, which contains a variable sigma i, which we refer to as local energy. And so as stated by the theorem that I announced before, so for the interactions with L-body interactions, the computational complexity scales as p to the L up to logarithmic factors. So this is a very reasonable approach, and it is very appealing in physics, for example, where we might know that the interactions order decays. However, if you want to go to a higher order and we have a pretty large system, then we see that this computational complexity can become to scale pretty badly. And the problem is really that we don't know in many applications what is the right basis functions hierarchy in which our representation of the energy function is sparse. So the natural question is, is there a better way to introduce or to model this basis functions hierarchy? And this is where this proposal of using the neural net parameterization comes from, it comes into play. So the suggestion is, if you compare the two loss functions that are written here, is to say that let us use a nonlinear representation power of neural networks and their ability to approximate functions and to replace this local energy function, modular the central variable sigma i, by a neural network representation. One can show that if neural net is expressive enough, meaning that in terms of function approximator, it contains this local energy function that we want to reconstruct in its hypothesis space, then the global minimum of this neural net-based GRIS estimator will correspond to the recovered local energy. And so as long as we can find the global minimum of this interaction screening loss based on neural networks, it is a legitimate estimator and we will reconstruct the right model. So the entire purpose was to say that neural networks somehow will explore a different basis function hierarchy. And so let us illustrate this on a small example. So here we take a small model on 10 variables with interaction order equal to 6. And we compare these two estimators where, so the one is based on the neural nets parameterization. And the second one is based on the GRIS estimator with the interaction order up to 6, so which has the benefit of containing the energy function exactly in its hypothesis space. So what we see is that neural network-based GRIS, it does a pretty good job at reconstructing the true model. So here what we've done is that we expanded the learned output of the neural network in the monomial basis, which is a very expensive operation but can be done for a small model like that. And so we see that indeed, so neural network explores a different basis function hierarchy. And so in particular, it contains high order interactions even beyond order 6 and then basically progressively learns to set them to 0. And so as a result, it learns the local energy function pretty accurately and using a small number of parameters. So in fact, this advantage of using a different basis function hierarchy is more apparent when we move to slightly larger problems. So by simply increasing the previous problem size from 10 to 15, we can quickly realize that the fully general model on binary variables with up to 6 order interaction in the monomial basis contains around 3,500 terms. And so basically we have to solve all the way convex optimization problem but still involving a large number of variables. And so GRIS quickly becomes intractable and you cannot run it on your laptop. And so if we compare the performance of the learned conditional distributions from both of these models and we can do this by sampling from both and comparing them in the total variation distance, we see that neural network, our basis representation essentially achieves the best possible error, which is due to finite sample size. And it is doing this with only 350 parameters and using less number of samples. So for this problem, what was feasible for us was computing GRIS model up to order 4. And we see that it is obviously results in the worst conditional distribution. So the takeaway here is that using neural network as a universal function of the estimator, which can figure out the parsimonious basis, can be very valuable and it boosts the computational efficiency and also sampling efficiency of the algorithm. So one interesting and important thing to note is that by approximating the energy function with this implicit neural network representation, we do not lose the attractive features associated with the original estimator. In particular, one illustration is the application to the structure learning problem. So can we recover the structure of the graph? So the inspiration comes from the separation property of graphical models that I mentioned in the beginning, which says that I'm trying to estimate the neighborhood of my node U. And the variables V, which lie outside of this neighborhood, they should not influence the output at the interaction screening minimum. So we can exploit this fact by imposing a regularization on the weights of the first layer of the neural network, which will force the algorithm to set these couplings to 0 if the weights inside neural network to those variables that are not in the neighborhood of the graph. And this is the illustration of this procedure on the graph, where you see that the edges that are not present in the graph, the corresponding weights are progressively set to 0 during the learning phase. And the couplings that are present, the edges that are present, they remain in this representation. So we can do structure learning with the neural network representation of the basis function. So to summarize, I've talked about two estimators. The first one is G-R-Rise, which is a convex estimator for learning general discrete graphical models in an arbitrary parametrization, which comes up with rigorous guarantees. And in fact, if you look closely to applications to previously known sub-cases, such as Ising models or Paralyzed BODs models or binary models with multi-body interactions, in all of these cases, the sample complexity of G-R-Rises improves upon previously known methods. And I've also talked about the neural net-based generalization of the G-R-Rise estimator, which provides a computational boost to the learning problem. Although it becomes a non-convex estimator, it uses the non-linear representation power of neural nets to explore parsimonious basis hierarchies. And the good thing is that neural network G-R-Rise still maintains these attractive features of the original G-R-Rise estimator. In particular, it can learn the structure of the MRF. With some suitable modifications, it can parametrize and learn the full energy function of the model. And finally, it produces the conditional distributions that can later be used for a sampling from the learn model. And I have presented several examples of this in the numerical session. So you can read more about this in our archive pre-print. And with this, I'll be happy to answer any questions you might have during the Q&A session. Thank you.