 And now just please go ahead. Thank you. Hello. Hello. My name is Tawang. I'm representing Project 8. I'm representing with Charles, Nishal, and Rong Rong. So this is the group members, me, Charles, Emily, and Nishal, and Emily. Her name is Rong Rong. So this is how our overview looks. We have the motivation and then the approach, like the thinking process that we've taken. And then we show the theoretical part, and then we show the experiments that we've conducted. And then we do the conclusion. So the motivation of this project is that we know that explanatory AI is now a hot topic. And the ability of explaining clearly the process that has led to a given solution in AI has now become more fundamental than before. So one of the well-known examples is the EU GDPR, which was worked around 2016, which came to effect in 2018, which stipulates that for any automated process, there should be a clear explanation as to what's the steps that were taken. So some of the examples that we have include automated online credit and mortgage scoring, irrecruiting without any human intervention, automated insurance code. And then it is fundamental to explain why systems suggest certain decisions, because we all have to respect the principles of ethics and fairness. So, but there seem to be a trade-off between rationality and a good explanation. For instance, the companies always want to maximize the profit. But then what maximizes the profit doesn't always lead to a good explanation. So essentially, we keep on asking how much rationality can retain if they want to maximize their profit. And if they do that, how good an explanation should be. So, yeah. And then between those two, you realize that there's a bit of a discussion between the rationality and the explanation part. So this project is in two parts. There's a part where we focus on the discussion and there's a part where we focus on the rationalization of the explanation. So our thinking process, which is the methodology that we used. So there are two parts. We came up with the symbolic part, which is we decided that we might consider using first order logic, which is part of deductive reasoning. So in deductive reasoning, you would use a set of rules, then you apply examples on them. And then the other part was to use a statistical approach where we considered the pageant decomposition framework, and then the synthetic benchmarking, and then the relevant and resolution tradeoffs. So all of these are papers. So there, which are statistical, they form part of inductive reasoning, which is reasoning from observations. So if we were to consider everything from a combination of deductive reasoning and inductive reasoning, then it would have to be within deductive reasoning, which is reasoning with uncertainty and probability. And then since we've taken the inductive reasoning, we considered DBMs under the resolution and relevant tradeoffs, and then a subset of them, which is the RBMs. And then the pageant decomposition framework can be linked to the large deviation theory, which builds directly from the resolution and relevance paper. So this part we did not consider. And then from the large deviation and optimal distortion, we did experiments, which is the coding approach part. Then in that part, we did the MNIST data set and the synthetic data set. But in this presentation, you'll only focus on the results from the MNIST data search. So for the theoretical part, so for a generic decision problem of outcomes as, let's say there are as possible choices, and the probability that S is an optimal choice can be given by one. So if we let LS be the length of the code word, like the code word that a corpus that forms a subset of the code word, the words, one of their corpus which forms the length of the explanation. And then we let that correspond to S. Then for an optimal rationalization, we'd have to minimize this function. And then if, so from this expression, we realize that this can also be written as an entropy. And then an entropy is going to give us the way to express the rationalization, which is the length of the of the code word. But to realize that taking rational choices this way may lead to choices which are hard to explain. Hence, we go and try a different method. And then we went and did the large deviation theory and optimal distortion. So in this case, it's a build up from the previous one. So assuming that we have QS and PSB probabilities of outcomes S, the rationalization H of Q and the distortion can be given by equations in two. So we want to take a decision such that the entropy of P is less or equal to HO, which is our maximum. So from two, we can solve an optimization problem that can be expressed in this form where lambda is the Lagrangian multiplier. And then if you take the partial derivative with respect to PS and equate that to zero from three, then we get this set of solutions. And then here from three, if you see here, there's a plus minus. And then the solution for mu is a minus plus. So if I select a plus on three, then I'll have to take a minus for mu. And then these solutions are case-wise. And then I'll give to my next colleague to carry on. So the large deviation theory gives us a way, the framework gives us a way to think about trade-offs between fidelity and compression when we want to communicate a decision-making process. So given a decision-making process which can be an algorithm that has distribution Q over the possible outcomes, we can compress that process into an explanation with distribution P. So the way to think about it is the Kovac Library Divergence that was shown on the previous slide is a measure for fidelity or accuracy, whereas the entropy of P is a measure for compression. And if we plot one against the other, we get a convex graph with lambda, which was the Lagrange multiplier as the slope. So here lambda is essentially the dual variable in a constrained optimization problem. So the interpretation is the lambda is the shadow price. And that is the amount of compression that we have to give up essentially in order to achieve a certain increase in fidelity. And in particular, now if lambda is close to zero in that regime, essentially we need to give up very little compression to achieve a marginal increase in fidelity. Next slide, please. So we test our theory with a deep leaf network. So a deep leaf network is essentially a composition of respect to both machines. And what it essentially does is it learns representations of the data at decreasing scales of resolution, meaning increasing levels of compression. So that makes that when we go from a shallow to a deeper layer, the original decision making process is compressed is coarse grain. And that leads to a more compressed, a more concise explanation, but at the cost of having a distribution that is potentially further away from the original distribution of the data. So according to the large deviations theory, the relationship between the layers should have this particular functional form where essentially the probability of a state in a deeper layer should be equal to the probability of that state in the shallower layer raised to the muth power, where mu is related to lambda, the Lagrange multiplier. Next slide, please. And so the question is how we can construct these distributions. So in a deep leaf network, for each layer, so for each representation, essentially you have a set of states distributed over all the data points or the data instances m. So we can calculate ks, which is essentially the empirical frequency of the data points for a particular state in a particular layer. And then we can know the statistics ks over m is essentially the empirical probability of a particular state in a given layer. And then we can study the evolution of how these empirical probabilities evolve across the layers, and if they conform to the predictions by large deviations theory. We use the n-nith dataset to train the deep leaf network to check our optimal distortion everything. Just like this is this formula. As we just say, ks is the one second. Yeah, thank you, Matteo. Sorry, Emily, please go ahead. Apologies. Yeah, as we just said, ks is the distribution of the shadow layer and ps is the distribution of deep layer. For example, if the ks is the distribution of layer 6, then the ps would be the distribution of layer 7. These two figures are our experiment results. For the left figures, the ssx axis is the logarithm of the distribution of layer 6. There is the log qs equal to log ks dym. ks is just the number of the state s and m is the number of the training dataset. For m-nith dataset, the training dataset, the number of the training dataset is 60,000. And for the y-axis is the logarithm of the distribution layer 7. That is the log ps. From the formula of our optimal distortion algorithm, we can see that if we take the logarithm on the formula, then we can see that the log ps can have the linear relation with the log qs. As we can see in our left figure, the results are consistent with our prediction. And except for the left figure, we also plot the right figure, the log ks versus log s. As we just said in our prediction, the log ps has a linear relationship with log qs. And since ps equal to ks dym, so the corresponding log ks also has the linear relationship. From the right figure, we can see that the layer first and layer 5 results is also consistent with our prediction. And we also can notice that the ks versus s follow the power law behavior. Yeah, so the results. Yes, but I hope you guys can hear discussed is the part of our successfully predicted the behavior of the network. But here we have theory doesn't quite predict what had distribution across two layers. But like the first two layers of a fairly deep network towards the end of a fairly deep network, then we see that the linear relationship doesn't quite hold. On the left in the figure, we see that a lot of the points have like separate states. They aren't grouped quite yet, which is what you would imagine would happen in the initial layer. And towards the end, the points are all over. And this actually has a very interesting and intriguing link with some of the other work Mateo has done around something called resolution relevant straight up where the idea is basically like different layers are doing some different creative in terms of how rich they want to represent. So deeper networks are doing some very generic average case representation, but have a few straight and and in the shallow layers, they're not doing the very rich representation, but are just doing some data processing. And there is some kind of trade up between that. And in the different people, they show also that in layer five, six, like the middle layer is where there is an optimal trade up between these resolution and relevance kind of parameters. And in fact, that is what we exactly see in our case to that this around the same layers, we see this perfect power kind of relationship holding. But as we look at the shallower level are a very deep layer, the relationship doesn't quite hold. And we suspect that this might be even distorted further if we do some kind of supervised learning that we use as labels and distorts the representation somewhere. Here, one next step would be obviously to see what exactly changes over these layers, but before we talk about the conclusions now. Yes, so we have a few different conclusions from here. So we started by saying we are we have a hypothesis that the layers in a deep network are related via this optimal distortion kind of relationship, which predicts a power law relationship between distributes across different layer. And there's a trade up between accuracy and compression. Now, lambda equals to zero is the part where we have like as as much coding length as we want. And that's when we essentially can get a perfect zero kill divergence, but neural networks don't operate quite in that regime. But if we look back in the in the graphs and check what the value of lambda is, it's not too big lambda is fairly small, which, which hints that the neural network is still trying to focus on accuracy as much as it can. And then we can actually ask further questions from here, like, is there a maximum level of compression, which is the future representation space only looked at the that we presented him, but we might want to put it's on the representation and explanation depending humans want and that relationship more. And this is a learning framework, but can this be extended to supervise in where we train you in the field and see if this kind of predictions. And in fact, once we take the label seriously shift in this kind of power relationship, or we might still see them. These are next questions to explore. Yeah, that's all for us. Thank you very much guys for this very nice presentation on this very interesting project. We lost a bit the last speaker. I think that was a delay, but please, this is what I was noticing. So there are time for questions. So if anybody has questions, please go ahead. Or where Mateo wants to add something or the students want to add something else about the project, there's still some time. I think they did a great job. Thank you very much. Yeah, I think they did a great job. You have a great fun. So I think they have more time to help you guys. I look forward to looking at the paper because I think that maybe you are going to keep working. So if there are no questions, so let me stop there.