 So, mae'n gweithio gweld. Mae'n gweithio'r cynhyrch gwybod y mwythbryd ni'n gweithio'r cynhyrch, ac mae'n gweithio gwaith ymlaen o gweithio'r cynhyrch gwybod o gweithio'r cynhyrch gwybod ymlaen. Ydych chi'n golygu eich cynhyrchu. Mae hwn yn gweld y gweithio'r cynhyrch i'r ffordd, byddwn i'n golygu eich golygu i'r newyrymaidru sy'n gweld ymlaen. Cymru ymlaen yn gweithio'r cynhyrch, My goal is to detect and quantify some modulation of your recording signal by an experimental stimulus or condition in the help of that builds up or helps you understand a little bit about the function. To do that we use tools from statistics. Primarily in our field there is a focus on determining statistical significance within the frequentist or the null hypothesis testing framework. But I want to emphasise I think there's increasing recognition that it's very important also to quantify in a meaningful way the size of the effect. So I think there are many advantages of the information theoretic framework but if you just take one away with you today I really want to emphasise the fact that it gives you a very nice consistent effect size across a wide range of statistical comparisons. Whereas many of the conventional tests and statistics that we're familiar with they have effect sizes that are a little bit harder to compare. So just to set this up, sort of where and how strongly does the stimulus effect the recorded signal. Here's an example of that question. So this is EEG data recorded from a single sensor in an event related design with a stimulus that is some kind of modulated image of a face with a parametrically varying visibility of the left eye. And then this is the ERP plotted for different values of that stimulus. So you can see that there's a very clear effect of the stimulus on the recorded EEG signal and we want to ask where and how strongly does it affect that. Well we might say it certainly seems to affect it by eye between here and here but sort of how strongly is a harder question. So the first thing we might do is a rank correlation between the recorded EEG voltage at each time point and the stimulus value across trials. And we get something that looks like this which starts to answer the question but I think doesn't completely because there's a zero point here. Does that mean the brain is not really responding? It's not significantly responding to the, there's no significant modulation there but it would sort of fall within the range whereby you would say there was an effect. So I'm going to try and convince you the mutual information. I'll show you how we build up to a result like this and try and convince you that this I think is a better answer to this question of where and how strongly is the signal modulated. So what is mutual information? I think there are many different interpretations that can be applied to mutual information related to coding, transmission, channel capacities and stuff like this is a very elegant theory about communication. But I think the most useful from a neuroimaging perspective is the simplest and to just view it as the effect size for a statistical test of dependence. So it really is the effect size for a test of dependence against the null hypothesis that two variables are independent. In fact it's equivalent to the log likelihood test of independence which by the nemen, Pearson, Lemmer is the most powerful test of dependence. So it's also quite a principal thing to do from a classic statistical point of view. But it's quite difficult to estimate in practice so one we just heard about symbolic methods for time series but what I'm presenting today is an alternative method which actually gives an approximation of mutual information. It's semi parametric that provides a rigorous lower bound which is useful for statistical testing. And of course I want to emphasize this code online and there's a preprint which goes into much more detail that's currently under review. So what are the particular key properties or advantages I think of this estimator? They are particularly its multivariate performance which I'm going to talk a little bit about. It's rank based so it's inherently robust and performs well on noisy neuroimaging data. Crucially we can combine continuous and discrete variables within the same framework so it sort of unifies what would have been t-test, chi-squared tests and correlations all with common effect size. And in the preprint we show it has equivalent statistical power to existing methods where they're available in these univariate cases. We do statistical influence with non-parametric permutation testing and I want to emphasize maybe it's very easy to use. We're not talking about a toolbox or hundreds of parameters of the sort of techniques you've been seeing today. It's just a plug-in function that replaces correlation and calculates a bivariate function of two variables that gives you this effect size. So I wanted to say a little bit more about what I mean by multivariate here. I'm talking about what you might call intermediate multivariate so two to ten dimensions. I think we're very well served obviously by classical university statistics. And I think recently we've inherited a lot of techniques from supervised machine learning for very high dimensional spaces like more than a hundred dimensions. And this method can't do hundreds of dimensions but it can do something in a range of two to twenty depending on your data. And I wanted to try and suggest that that fills in like a niche in the middle here where you can still exploit the local properties of your signal in space and time. But at the same time get a bit of a boost in power over having to do univariate reductions. I wanted to again emphasize it's really very simple. You just concatenate your multivariates, your multiple variables to do a multivariate calculation. And also I think it could actually be combined with the supervised ML to give again a measure of performance that gives again this common scale that allows you to compare many different situations and do some of the other advantages that I'm going to mention if I have time. So some examples of these sort of low dimensional multivariate responses that I think are quite useful. We have obviously complex spectral data is two dimensional and it's often interesting to split that into phase and power. In a multivariate setting these circular variables become less of a problem because we can just keep the 2D representation. We have again MEG magnetic field vectors which we might want to look at without reducing to a single maximum variance direction and also look at their amplitude and orientation. I'm going to show today an example of adding in a single trial temporal derivative. With fMRI we might want to have multiple measures of activation within a single voxel for example a B to corresponding to the HRF as well as its derivatives. And we might want to consider multiple responses at the same time for example to get a representational interactions which I'm also going to hopefully tell you a little bit about. So this mutual information is a bivariate measure of dependence but I'm going to try and hopefully convince you that it forms the basis of a framework for data analysis. And there are other information threading quantities that consider more than two variables and let you answer additional questions. I mean obviously this is a bit of an overview but I just wanted to make the point that we have many different statistical tests which have obviously effect sizes that are hard to compare because you have first of all inherently different statistics and also very different degrees of freedom in different particular experiments. So all of these things are maybe hard to compare but they all have a direct information threading sort of analog. And the crucial thing is that the information threading values they really give you common quantitatively comparable effect sizes that you can meaningfully compare in a quantitative way add and subtract and so on. Conditional mutual information is like partial correlation that lets you condition out the effects of other things. We have direct information which is like transfer entropy. We just heard about we have a measure of communication which is I'm going to hopefully mention. And we have other measures like interaction information which is maybe conceptually a little bit similar to RSA in that it lets you quantify the similarity or representational interactions between signals. Most of the concepts I'm going to show today are MEG and EEG but I just wanted to show a very recent result that's not in the preprint which is that we can also do it reasonably well on FMRI. So this is just a single subject visual oddball versus standard. This is a standard SPMF test output with a family wise error of 5% and this is the sort of single voxel mutual information calculated from single trial betas. So I done with a permutation test thresholded with a permutation test so that you can see we don't lose much. And I mean obviously there's not really much point in doing it if you're just going to look at this image but with the information framework we have the advantage that we can now look at the redundancy between these regions between individual voxels and so on. In fact here are some of the advantages so we can do computationally efficient non-parametric single subject statistics. We can consider multivariate these multivariate betas which I think could be more powerful in the F test which tests hypothesis that at least one of them is significant but here we can really find multivariate effects. We can directly evaluate and condition out effects of other trial by trial confounds and we can look at these representational interactions which in information theory we call redundancy or synergy. Either between regions between individual voxels or with alternative modalities such as simultaneously recorded EEG. So I'm going to tell a little bit about representational interactions. In many cases you might have two sort of statistically significant modulations in two different responses. Whether they're different cortical regions, different temporal regions, different frequency bands or as I just mentioned different signals like EEG and fMRI. In all of these cases you might ask to what degree are they similar, the modulations. If you observe one, does observing the other one add any more information or is it sort of overlapping and in fact you get everything from the first one. So returning to the original example that we saw up to the rank correlation. If I calculate now the mutual information at each time point just in the EEG voltage we get something that looks like the absolute value of the rank correlation because mutual information is unsigned. So here we have exactly this case with two peaks, an early peak and a later peak. So we might want to ask to what degree is the information here the same as the information here or do you do a better prediction if you looked at both of those time points. So we can answer this with this quantity called interaction information which I'm going to skip the detail but conceptually measures this overlap in the information between the two responses. And because of a nice property of a mutual information this is just a linear thing of combination of information values. And then there are three outcomes for this quantity. We have either redundancy which means that there's an overlapping representation and it suggests that the modulations reflect the same processing mechanisms. If it's zero it means that they are somehow independent and we can also get synergy which means that the actual trial by trial relationship between the two responses is itself modulated by the other. And as far as I know synergy, while redundancy can also be addressed with RSA or cross decoding methods, I think synergy is something that can't really be addressed with those methods. So here's an example of this. I do this interaction information between every time point. This is a single subject, a single EEG sensor that I showed before and we do the interaction information between every time point with every other time point. So here negative values are redundancy and you can see that this peak here is mostly redundant with itself and also these two peaks are sort of redundant with each other. But I was very surprised when I saw this to see this very strong synergy and I wanted to point out that this synergy actually corresponds to a region of the time course where there's no information in the voltage. So I really didn't expect to see this and I tried to think about it. What does that mean? It means that the actual EEG voltage value doesn't tell you any information about the signal. But when you know two neighboring values, wherever they are sort of an absolute value, that tells you something. So I think the simplest relationship between two points is the difference between them and then that's the single trial gradient. So I tried that out. I thought it would be too noisy. But in fact, if you do the statistics on the single trial temporal derivative, you get a very strong modulation, which for me was revealed by the synergy. So if you did have to do one dimensional statistics, in fact, in this experiment, you would be better off doing it on the single trial gradient, which is something I haven't seen people do a lot in EEG and I thought was interesting. But within this mutual information framework, we don't have to do one dimensional statistics. It's all multivariate. So I can just throw both responses into a two dimensional response. So at each time point, we consider the voltage and the instantaneous gradient of the voltage. And now we get this curve I showed you at the beginning, which I think answers this question of where and how strongly in a much clearer way than the conventional rank correlation. Because this really shows you, OK, it starts here. We have this profile where there's a little ledge of weaker stuff, then it becomes strong and then slowly tails off. And I think it would be hard to pick up that temporal profile from looking at this. And again, we can do even multivariate interactions between the two dimensional responses. And this shows in fact this ledge is somehow independent representation from the later part. So this is a work in progress, but hopefully it shows some of the application. So even with a single channel sensor level EG, these methods can add something over a standard statistical analysis. And hopefully I'll convince you that there's something useful, the synergy. Although these concepts can be quite complicated, here it directly showed something useful for me, which was to look at the gradient. And all of this can be applied directly to EG and MEG. And it's relatively quick and easy compared to many other computationally intensive techniques. Connectivity communication, I don't think I'm going to have very much time, but I'll just skip over in detail. The point is we use this idea of redundancy from the interaction information and combine it with the Wiena Granger causal framework. So to make the same step from just looking at differences in activation to looking at the information value modulation by specific stimulus, we do that in this Wiena Granger causal framework to say that it's now no longer just a relationship between the activity in two areas. It's really about the information content in the two areas. So the basic argument is this. If one region shows information about a stimulus at an earlier time T1, and for example a left electrode, and then another region for example electrode on the opposite hemisphere shows information at a later time, and crucially if this information is really the same. So the representations here just in this one-dimensional space are the same or two-dimensional. And then this suggests that they're communicated from A to B. And we have a quantity called directed feature information, which is like an extension of transfer entropy that incorporates this idea. So that instead of thinking about just the activity, we're thinking about the second level thing, actually the representation or the information content in the signals. So it actually measures the redundancy about F between the past of one signal and the present of the other signal, sort of conditioning out the past. So I just didn't have time to go into that too much detail. But the main thing I want to get across is that we have a practical statistical framework for neuroimaging data analysis based on information theory. It's a simple statistical function, it's a plug-in replacement for correlation, but crucially can handle multiple different statistical comparisons, including the sort of intermediate multivariate, continuous and discrete variables. And in all of these it gives you effect sizes on a meaningful, I mean actually additive common scale. So I think having a consistent effect size is particularly important for large-scale automated results, repositories and for meta-analysis, that's what we've heard about this morning. So my hope is that this framework or something like it, where you have many different tests or the common effect size, could actually really help with this problem if it was widely adopted, because meta-analysis become much easier. I already find it quite useful in my work that I really have an intuition now for this effect size scale, and when I look at a new data set, I can easily tell if it's out by an order of magnitude because I'm just used to working with the values. So I think this thing of really being able to compare behaviour, being able to compare all different responses in different experiments, you can also look at comparing the output of models or simulations all on this common scale. I show an EEG and a little bit of fMRI. We also use it with MEG. It can also be applied to LFPs, all combined as I hope to try with large-scale multivariate decoding. I'd like to thank my collaborators, primarily in Glasgow, but also Stefano Panzeri who developed the DFI measure. And just to reiterate that there's code online and a pre-print that explains it all, probably much better than I have today. It's a very simple function, so very easy to try compared to probably many other things. Oh, thank you. Thank you very much, Robin. Alegent presentation. Yeah, I was just wondering about the effect size. I love this idea. I think the effect size should absolutely be important to me and it has the power aspect. I was wondering a bit of the power aspect when it relates to the currency. And especially, I mean, how does the effect size measure improve the amount of data? Is there a dependency? Yeah, there's sort of a weak dependency. In the information theoretic world, we call it bias and you have to be careful about having this bias. I didn't go into too much detail, but this new estimator has very, very low bias, at least sampling bias. So effectively, as long as you have a reasonable amount of data, it's not really a problem. I think that's really the advantage of this estimator that lets you then compare across different sample sizes, for example. And then comparing to D-prime, I'm not so sure we want to do some sort of more grounded power analysis and work on that. But I mean, they do have a real, you know, one bit means a reduction of uncertainty of factor two. So I mean, there is some intuition that, you know, you can build up, so... The other question, partially, is the amount of data that you need to stop, especially in the multivariate case. Yeah, I would say it depends how strong your effect is. We can think of it and prevent, like, a co-ban analysis for instance. Yeah, I mean, I have some other ideas about group analysis. So I would always do this sort of at the single subject level and then combine the infomer... Yeah. Yeah, I would say that you can... I mean, it's not magic. It's not any really better than any of the individual tests like a t-test or a hotelings t-squared test. So if you have enough data to get a result with that, you will get a result with this. The example is that the advantage is then, you know, that you can compare it more easily across multiple cases. Quick question, myself. Robin, is there any qualitative difference between the application of this method to FMRI data versus EEG or NEG? I'm thinking particularly of the HRF. Yeah, I mean, I do it on the GLM. I mean, I do single trial beaters from the GLM in the same way many people do from multivariate. From multivariate. Well, no, I mean, you first have to do this sort of non-information theoretic analysis step where you extract the activations with the general linear model. But then once you're working on your activations, it's the same. Except you don't have a time course. You just have one per trial. Because it relies on this sort of semi-parametric assumption that it only quantifies Gaussian copula dependence. So you can have any marginal distributions, but it will only quantify the dependence based on the assumption of a Gaussian copula. So that means it gives you a lower bound on the real information, but it means it scales much better to multiple dimensions because you have the advantage of the Gaussian properties. So it gives you a lower bound, not the true value. So of course there are advantages to doing the much more computationally intensive techniques. But I think what I've found is that in many realistic experiments, it's sensitive enough. This advantage is it gives a worth the trade-off that you're not actually measuring true on linear mutual information. So when you calculate this copula based mutual information, how does it relate to the other typical use of the Gaussian window-based estimation where you have the continuum error? Yeah, I think it's just a different method. I mean, I think by pausing the new window you mean a kernel density method. Then I guess... We will have to choose some sort of parameters that are there. When you actually have general continuum error, I think this question is very important because you emphasize that you could actually use this Gaussian copula to calculate the machine machine for both discrete and continuum error at the same time. The ranking actually matters, so the action parameter matters. So I guess I don't fully understand the question, but it's a lower bound, so it doesn't give you exactly the same as the KDE methods. I think within neuroscience the most common alternative methods are these nearest-neighbour methods, which have their own problems with low bias but very high variance. So I think it's a trade-off compared to those methods. It doesn't measure every non-linear effect. You measure only the Gaussian copula dependence, but it's so much better behaved and still gives you a consistent effect size within that. So this is where you rank stuff, so this is where you rank the lower bounds. Sorry? When you rank different machine machine values, you calculate in this way where you actually rank the lower bounds. I'm sorry, I don't know. Ranking the lower bounds. The lower bounds of what? Because you mentioned that the copula can actually only measure the lower bounds. Yes, yes. If you actually can only rank the lower bounds, that's not a very strange. I'm not sure what you mean by rank, but maybe we can discuss it afterwards. Can you comment on the relationship to the distance correlation, which is also an interesting mutual information estimator that has some good properties? Does it scale? Is it better behaved for this lower-dimensional regime where you actually have Gaussian copulas, would it be more powerful than using distance correlation? Well, I think the distance correlation is certainly better in high dimensions, because ultimately you rescale, you normalise your data, and then you estimate a covariance matrix. Obviously there's a limit with your data of how big a covariance matrix you can estimate. So I think in high-dimensional cases, distance correlation is much preferable. And also distance correlation really captures this nonlinear stuff without this assumption of just capturing Gaussian copula dependence. I'd like to try and think some more about the real relationship between them. One thing we have looked at is doing pairwise mutual information between many different stimuli and trying to think about that as similar to the distance correlation and so on, but we don't really have any concrete thoughts. But I think with the distance correlation it's the key advantage that I tried to get across here. One of them is being able to do this interaction information, and I didn't talk about it today, but there are extensions of this with a partial information decomposition. They'll let you do this to really get a sort of different information contact in different low-dimensional variables. I think that's useful. I think that's something you can do with the distance correlation, but I mean it's the right tool for the problem, I think. Okay, I think we have a stimulating discussion here, but we still have one more event to do, so I'd like to thank Robin for a fantastic talk. Thank you.