 You should write it in the Q&A box and it would be better if they before the question write to which speaker the question is addressed. So with this look out the floor it's yours. Thank you Artel, I hope you can hear me well. So I'm very happy to participate in this, Mike Scott. Your voice is a little bit low, can you speak closer to the microphone? Can you hear me better? Or is it still low? A little bit, yes. Can I try and work on it? It's kind of the maximum. Can you still hear me very low? A little bit better, go ahead. I'll try without the headphones maybe. Sorry? I'll try without the headphones, I'll try that way. Okay, I think it's okay. Can you hear me better now? Yes, a little bit, yes. Okay, sorry about this. Okay, today I'm very happy to talk about some work done in collaboration with a very bright student who just got his master from ENS, Hugo Cui and Lengas de Barrova. And in this talk I will be talking about active learning. And so first I will give a brief introduction to the field. And then I will talk about the theoretical framework we employed to derive some theoretical performance bound for active learning in this simple model. And we did this through a large division analysis based on theoretical calculation. And then finally we will briefly talk about the algorithmic side of the problem where I will show that AMP can be used to almost saturate these theoretical bounds we derived in the analysis. So active learning is a field of machine learning that deals with problems where you have a lot of data, but obtaining the labels for this data may be very expensive. And a very important example is text classification. As you know, Internet provides basically an infinite source of text. But if you want to know what's the topic inside of one of these pieces, then basically you need a human to annotate to say what the topic was. And this is an expensive process. And it is very important to have some strategies to select which of these input data should be labeled in order to extract the best information possible. And in order to train a model that gets to the best possible generalization. And specifically pool based active learning is a case where you have a fixed set of examples, which is our pool. And from these you can choose a certain subset of example to be labeled. And the size of this subset is your budget. So you're given a certain budget and you will try to do your best choice in order to get to the best generalization. And what this amounts to is to have some type of cycle, this pool based active learning cycle where you have your machine learning model and you have the unlabeled data pool. And then what you do is train on whatever you had before and test it on the unlabeled data. And you see where the model is more confident and where the model is the least confident, it will try to get the labels in order to extract more information. And then you retrain your model and you keep repeating the cycle. So instead of having a complexity, computational complexity of order and squared as a usual training algorithm here you have to cycle through many times an extensive number of times. So the computational order is at least n cubed. So of course we wanted to do some theoretical analysis and you have to pick a very, very simple model in order to get through the computation. And so we go back to our favorite teacher-student perceptive model that is one of the simplest model where you can define a notion of generalization. And this model you first extract a teacher vector and then you extract matrix both from IID Gaussian priors. And then you use the teacher to obtain the ground truth tables for these patterns by taking this color product between a teacher and the patterns and taking the sign on this. And this is your training set. So what you want to do is to find a student perceptron, which is another vector, which is able to give the same classification in a training set. And then you want to see whether it is also able to give the same prediction on previously unseen examples. And what we added to this framework is that now the student also has a budget. So instead of seeing an entire pool of examples, it will be able to select just a fraction of it. And this n is of course smaller than alpha, which is the entire proportionality constant between the input dimension and our data set size. And note that even though we are selecting the very simple case of IID normal patterns, when you select with an active learning rule your subset of patterns, then correlations can appear in the subset. So it's not that real. What is the most important ingredient for making us able to do the calculation is that in this very specific model, there is a direct relationship between the mutual information between the teacher and the labels. And another quantity that we all know, which is the garter volume and the garter volume is nothing but the volume of the or the entropy of hypothesis for the student that give the same predictions on the training set. So if you have a smaller volume means that you have less uncertainty about the teacher, and so you will generalize better. So what our trick was, is that instead of studying the large division error, which is not triggered to do directly, we could study the large deviation of the garter volume. And so what we did is to introduce a set of selection variables which date values zero or one, whether you choose or not the pattern to be put in the training set. And then it's a probability measure is based on the introduction of an energy controlled by the inverse temperature beta and the energy is the garter volume. And then you also have a chemical potential phi, which sets the budget for your student. And what you can do is that in the high dimensional limit, everything should be self averaging, and you should be able to just study the free entropy that you get in the typical case. And this can be split into three contributions. You have the contribution from the garter volume, the contribution from the budget, and finally you have a contribution that gives you the complexity. So the log number of possible choices of subsets that gives you that same volume. So what we really want to do is to see how many possible choices gets you to some better generalization than random. And the catch is also that this entire calculation is not saying anything to choose the level subset. We're just saying in general, what is the best possible performance that you can achieve in this setting. And so when you do the replica calculation, even though we did it in a replica symmetric assumption, since it is basically a two level problem, the replica calculation really looks like a one RSP calculation. You have two overlaps between students, and the first one will be between students with the same choice of labeled subset, and the other one will be with different choices. And then of course we will have a norm for the students in a typical magnetization, which is the important parameter for the generalization. It gives you the overlap between the teacher and the student. And the expression you get in the end really, really looks like the one RSP expression for the perceptron, apart from the fact that in the energetic term we have a trace over the selection variable. And what we get is this type of phase diagram. So there are a lot of curves in this diagram, but in general we will focus just on one. So here the pool size was alpha equals three. And we can look for example at the red curve where the budget is 0.3, so it's one tenth of the full data set. And what we have is that when beta is equal to zero, so we're looking at the typical case, we're not studying any large deviation, we get the description of the case where you're taking our random subset of the patterns. And this is just a typical Garner volume for that size. And the complexity of choices, so that the number of possible subset that gives you this volume is nothing but the binomial distribution. However, when you turn on beta, both positive or negative, you can study ith typical cases. So what you really care is the beta negative case where the volume will be smaller than the one you would get from random choice. So that you're doing better generalization than in the random case. And however, from the computational point of view, finding these ith typical subset is a very hard problem. Also note that on the left of the plot we have a vertical axis that shows the volume that you would get from learning the entire pool of data set of patterns, sorry. And what you see is that while that vertical line works for alpha equal 3, even if you take a budget 0.9, you can get very close to this volume, meaning that the entire information that was contained in the data set is actually contained in a much smaller fraction of the data set. And what we also saw is that of course when you get to smaller volumes, the magnetization increases, meaning that you get more aligned with the teacher, so the generalization gets lower. Another way of looking at this large deviation is to look on how fast you approach or you saturate the maximum amount of information contained in the entire data pool. And what we show in this plot is that, for example, for different colors, we have different pool sizes. And we see that basically the number of patterns you need to pick from these pools is logarithmic in the entire size of the pool if you want to start basically almost the entire information. And so basically the decrease in the volume becomes exponential, which is not the typical case, which in this plot is the purple curve. However, we have to go back to a detail which I mentioned, that is the fact that our theoretical bounds apply for any active learning algorithm. And this means also for our algorithm that were informed on the generative process and were able to exploit some external information, which is not really the case that we usually consider. In the case where you have no prior information about the generative process, the most information you can get from any every pattern is of course one bit of information because that's the sign of the pattern you're actually querying. And this is represented by the volume-healthy curve, which is the dotted black line in this plot. Still, if you even stick to this curve, you're extracting information with a logarithmic number of patterns with respect to the entire pool of patterns. So in this case, where we have no information on the structure of the patterns, we talk about algorithmic strategies that are usually employed to do active learning. And the most common one is called uncertainty sampling, and it's very simple. Basically what you do is you keep updating your learning model and when you have to choose more patterns to be labeled, what you do is that you test your model on the unlabeled patterns and look where the model is currently least confident and then you label these patterns. So you go through this cycle where you train your model, evaluate model predictions, you sort according to its confidence, and then you query the labels of the most uncertain samples until you have used in your entire budget. So what we did was to make a comparison between some known algorithmic strategies and an algorithmic strategy that we adapted for this specific case, which is based on a message passing. And of course we are a bit, I mean, it's not that impressive that we're doing so well because we are in a model where we know that AMP is estimating posterior means and variances almost perfectly. However, we can see in this plot that, for example, the yellow points representing the performance of AMP almost stick perfectly to the volume-healthy curve represented by the dotted curve, while all the other algorithms kind of fail to do as well. And an example where it is pretty famous is, for example, a query by committee, which is a famous paper on this subject. And the fact that we're doing better is just because query by committee is trying to sample the posterior by actually learning various models and trying to making them independent while we are able to do that with message passing so efficiently. I'll skip this because I have no time. And I'll go to the end. So in the end, we have this analysis and we have also some links from this analysis. The first one is that we did the stability analysis for our replica symmetric answers. And we see, we saw that indeed the one RSP would be needed to describe the most extreme cases and the RS results are only stable around the beta equals zero case. However, we saw from the algorithm results that things seem to behave quite close to what we got from our theoretical calculation, so we don't expect the one RSP correction to be very large. Another important point is that we had to use the streak of studying large deviation of the garden volume instead of the generalization error directly. And this is not doable in many other models, so it would be nice to have a more general approach. And another interesting point is that, as I said, when you take a subset of IID patterns, they're no longer IID. And so AMP is no longer guaranteed to converge. And this is interesting because if you look at from a constraint satisfaction problem point of view, then basically you have that AMP makes... It's harder for AMP to converge when you have less than more constraints. And finally, we are looking into trying to connect this framework with labor rewriting strategies, for example, soft labeling and distillation, which will try to get the best possible generalization out of this process. So thank you for your attention and I look forward to your questions.