 Hi, my name's Emma and today I'm going to talk to you about a new framework we've developed for modelling selective attention during cocktail party listening. In our everyday lives we're often faced with a challenge of understanding speech when other sounds are present. A feature of attention is that different stimuli can be selected without any change in sensory signals. In other words, you can switch your attention between different people's voices given the same acoustic signal. And we know that in these situations people get a benefit from knowing where to listen in advance. So if you imagine a typical experimental situation here where the listener is in the centre of an array of loudspeakers, they hear three different phrases spoken by three different talkers from different locations in front of them. We can tell them in advance which talker they should selectively attend to by presenting a visual cue that appears on the screen in front of them. In many situations attention isn't all or none but rather fluctuates over time even when the attribute that's attended isn't obviously time itself. For example reaction times progressively improve as an instructional cue is presented longer in advance of the target talker. Also EEG activity gradually increases in amplitude before the target talker starts speaking as shown by the red line on this plot. And these effects which are on the order of a second or longer appear to be distinct from faster changes in attention that have been previously associated with theta oscillations which may be related to the sampling of stimuli. Here we wanted to know what the computational processors underline the slow temporal evolution of a tension set. So today I'll introduce a new generative model we developed for attention in the context of a classic cocktail party paradigm. We use this generative model to test competing hypotheses about the way that human listeners direct preparatory and selective attention. Here we focus on the top down attention aspect of cocktail party listening rather than on other processes that are involved in the task such as extracting words from a continuous acoustic signal which is something that we've examined in previous work. Our goal is to explain behavioral and EEG results that show a slow temporal evolution of attention under a unified framework. So we used active inference to simulate synthetic agents that act under different generative models. An active inference system has been applied to a variety of domains in neuroscience. It's an extension of predictive coding based on the idea that our brains actively predict the speech signal based on an underlying generative model. And so essentially this means that we treat cocktail party listening as a Bayesian inference problem as inferring the correct response to make given our sensory observations. Computationally we use approximate Bayesian inference and this relies on evaluating a quantity called free energy which is related to model evidence. As you can see from this equation free energy can be written as a trade-off between accuracy and complexity. Essentially this means that people update their beliefs to provide the simplest and most accurate explanation of sensory data. By minimizing free energy they simultaneously maximize their evidence for their generative model of the world. In this work we model a simple version of the cocktail party paradigm in which a visual cue directs attention to a talk on the left or right and the task is to report the color and number words at the cued location. In this example the visual cue points left so the correct answer would be white one. In our generative model the relevant outcomes in the world are the visual cue, the color and number words spoken by the talkers on the left and right, and feedback about whether the response was correct or incorrect. The things we assume people are trying to infer are the direction they should be attending, how strongly they should be attending which will become important later, the target color and number words so that's the words on the cued side, and also the response that they should make. In this generative model beliefs about how strongly I should attend affect the precision with which beliefs map to outcomes. We can think of precision like a dial that varies continuously from high to low. When precision is high our beliefs are strongly related to outcomes in the world whereas when it's low our beliefs are only weakly related to outcomes. And you can imagine that people might want to conserve energy by only having high precision when they expect a target stimulus to occur and this relates to long-standing ideas of attention as a resource that needs to be allocated. So we tested five different models about how precision varies over time which correspond to five different hypotheses about how the slow attentional set could be instantiated. One hypothesis was that precision would increase linearly over time starting from the time that participants saw the visual cue. The second was that precision would increase exponentially staying low until the expected time that the target talk would start speaking. The third was that precision would follow an exponential cumulative density function essentially increasing rapidly early after the visual cue then staying stable. And the final two hypotheses were no hypotheses essentially stating that precision remains the same over time. And we specified two different null hypotheses one with low precision and one with higher precision so that the null hypothesis wasn't biased towards a particular precision value. On each trial the synthetic agent could respond with one of the 16 color number combinations. We also specified different times at which the agent could respond. In other words the agent could respond early or wait to gather evidence by listening to the color and number words at the two locations. And note that the agent can only work out the correct response by observing the visual cue then using that to attend to the correct location then responding with the color and number words spoken at that location. We used this generative model to model both reaction times and EEG responses. First we'll consider reaction times. Here we're seeking to explain the finding that reaction times depend on the length of time that the visual cues on the screen before the talkers start speaking which is displayed on the x-axis of this graph. The empirical data is shown by the grain line on the plot. And when the cues presented longer in advance we found that participants are faster to respond. To simulate different cue target intervals we presented the visual cue at the start of each trial and we varied the time that the talkers spoke the key words. This means that under all models apart from the null models the agent had a different precision value at the time that the words were spoken. And our simulated reaction time was the time integral of the gradient descent on free energy. This means that if the model takes longer to make the inference about which response is correct then reaction times will be slower. We found that all of the models fitted the empirical data very well even if they didn't incorporate temporal changes in precision. They were within one standard deviation of the data. That's shown more clearly by the plot on the right which shows the root mean squared error between each of our models and the data. And actually one of the null models has the best fit very slightly. This means that an increase in precision is not needed to account for faster reaction times with longer cue target intervals. We can also simulate neural responses using this framework by treating neuronal dynamics as a gradient descent on free energy. So we can think of this as starting off at one point on this function and descending until we reach the free energy minimum. Thinking of this in terms of prediction errors the prediction error is equal to the negative free energy gradient. And we continue updating our beliefs until the prediction error is zero which means that we've reached the free energy minimum. And so the idea is that neural activity starts off by encoding prior expectations and ends up encoding posterior expectations. So we can make predictions about EEG responses for each of our models as well. These plots show simulated EEG responses during the preparatory interval under the five different models. We found the clear distinction in the predicted EEG responses between the models with different precision profiles. The two null models shown on the right which had no temporal changes in precision showed flat responses whereas models shown on the left which incorporated time dependent increases in precision were associated with increased responses during the preparatory interval. We put these simulated EEG responses into a hierarchical regression with the empirical EEG results as the dependent variable. And of the models we tested the exponential model best accounted for the ramping of EEG activity during the preparatory interval. So these results demonstrate that temporal increase in precision is needed to account for the ramping of EEG activity that occurs during the preparatory interval, unlike the reaction time data. So in summary I presented a new framework that treats cocktail party listening as a Bayesian inference problem based on active inference. We found that time dependent changes in precision weren't needed to explain faster reaction times with longer preparatory intervals but they were necessary to explain the ramping of EEG activity that occurs during the preparatory interval and therefore these two sets of findings might be underpinned by distinct processes. In our generative model the time dependent changes in precision can be considered more generally as a possible mechanism for temporal attention reflecting different expectations about when a target stimulus is expected to occur. In our model this form of temporal attention is factorized from the spatial attention factor which determines which spatial locations the focus of attention. In other words the model contains separate beliefs about where and when to attend. And we hope that this model will be useful in future work. It could be used to simulate a variety of other empirical effects and it generates quantitative predictions for behavioural neural responses which can be formally tested. So I'd like to thank R&AD for funding the work and thank you for listening.