 Hi everyone, thanks for this opportunity to present some of the work in my lab and special thanks to the organizers, particularly Charlotte and Tobias for this invitation. So when they asked me to indicate the topic that I will discuss at this meeting, I felt that auditory salience is an important aspect of our experience when we perceive sounds in everyday life, in that it explores what sounds in our sensory stimulus command our attention in voluntary fashion. And by understanding what makes sounds salient, it informs us how we can encode the complex information in everyday soundscapes by focusing on what is important, commanding, conspicuous, probably ecologically or perceptually irrelevant, as it's not realistic nor practical to encode all sensory information in our surrounds. If I start with this visual analogy and I present you with this image, you can infer a lot of information from this visual scene. This is obviously a movie theater, performance hall, the movie or performance is probably in the way since the lights are off, yet what grabbed your attention is probably this one audience member who is taking a phone call or browsing her phone. And what makes this aspect of the scene salient are a combination of sensory cues, including changes in luminous and contrast, along with contextual cues about what the setting is about. So now if I move to an auditory example, things are obviously a bit more different. You probably cannot discern much if I show you the time waveform. And even if I show you the corresponding time frequency spectrogram, it's really hard to discern if there is any commanding events that may stand out in this specific auditory scene. Now, if I play the sound, I will engage your auditory system to process this information and the auditory system will use its sensory and attentional processes to guide interpret what the scene is and whether there is something salient or distracting in the scene. So hopefully everybody heard that early on in this quintet performance, somebody's phone started ringing. The phone ring is not necessarily louder than the rest of the signals. If I show you where the phone rang, you wouldn't necessarily, if we had relied only on dynamic range to decide whether on how to code this information, this ring would have gone completely unnoticed. Yet it is an important event. So the same would apply to other ecologically important sounds, an alarm sound, a siren, et cetera. So our goal is to understand what makes a sound command our attention or become salient in a dynamic scene and what processes facilitate this representation in the auditory system. So if this were a study of visual salience, we could have used eye movement as gaze is a good indicator of involuntary attention. But instead for an auditory stimulus, there is really no good equivalent to this behavioral response. So for this study, we use the dichotic listening paradigm and specifically listeners are presented with two dynamic scenes, one in each ear. So we have here scene X and scene Y and these can be an audio recording from a variety of sources. We included speech, nature sounds, a cafeteria, and you could listen to a sporting event basically on your left ear and an orchestral piece on your right ear. And subjects were asked to listen to this challenging environment and point at every moment in time which scene is more attention grabbing using the computer mouse on a screen. So while this paradigm is really actively engaging listeners in reporting which scene left versus right is attention grabbing, it allows us to probe the subject's engagement on a moment by moment basis. And I will show you later that we can validate these results with a different paradigm along with you know physiological recordings where subjects are not actively engaged with these scenes. Now the other important note to make with this paradigm is that by averaging the response to say scene X versus scene Y versus scene Z, we are probing whether there is an event in scene X that was deemed salient or attention grabbing regardless of what was going on in the opposite ear. So what we do is we average across many different scenes to define moments or events in the scene in a reference scene that we call salient or attention grabbing. And now if we look at specific events in a scene that were deemed salient, then we can ask what changed at this moment that causes an involuntary pull of attention. So we perform the first analysis by comparing changes in acoustic features from before to the salient event and examine the range of attributes that we hypothesize would play a role in driving salient's perception. So an obvious first attribute is the change in loudness as captured by temporal envelopes either over the whole signal or over critical bands of the signal. We also looked at a range of spectral and temporal attributes of the signal either over a time slice before and after the event or across a temporal window. And so what we notice is that there is a wide variety of features that seem to correlate or drive salient. So what you're looking at here is on the X axis is the change in feature and on the Y axis is a wide variety of features that we evaluated. And you see that there is a number of features that show a statistically significant change near the event. So particularly we know that loudness is obviously a big driver of salient's perception. A wide range of spectral features, including bandwidth, spectral brightness, spectral scale are also drivers of salience or correlates of salience. Harmonicity or change in harmony city, I should say is a very stable marker of correlates of salience along with a number of temporal dynamics, including both low temporal rates, which are commensurate with syllabic rates in speech, let's say, as well as faster temporal dynamics that tend to correlate with a perception of roughness in the sounds. And so while we see this broad range of both spectral and temporal features that drive salience is also important to keep in mind that these features tend to also be correlated with each other. So we have to interpret these effects as potentially codependent features rather than orthogonal dimensions of salience. It is also worth noting that since we're using natural sounds, those sounds themselves reflect a great deal of constraints on their spectral temporal structure, which would shape how these interdependencies across features come about. So this becomes an important point that complicates our ability to predict and model salience that I will touch on later in this talk. So now we know that salience is modulated by not just the absolute levels of these features, but the relative levels. And that is an important aspect. So, for instance, in this analysis, we look at the overall loudness. So what I'm showing you here on the y-axis is the overall loudness of a group of high salience events. And what we're looking is that the loudness basically changes a great deal just before the subjects indicated that there was an attention grabbing event. But if you compare the same or you perform the same analysis on a subset of other events that were what we would call a mid-salience event. So these are events that not everybody found to be salient or the amount of salience or sort of attention grabbing was not as consistent across subjects or across opposing scenes. You see that the overall loudness is more or less the same. So an event that is a mid-salience event is not necessarily any less louder than that from a high-salience event. But really what is the difference is that the relative change is really what is different. So a sound can be equally loud into different contexts but may not have the same degree of salience perception depending on that context. So the next observation we also looked at is that salience is shaped by long-term acoustic changes not just local changes. And so this is another fascinating observation is that in this analysis we look at changes over fairly long-time windows. So this applies mostly to repetitive events. So for example, if you consider a phone ringing and if we look at the overall loudness shown here in this red curve, sorry the blue curve. So we're looking at the overall loudness of a phone ring. You see that the first and the second ring have more or less the same overall loudness. Yet when you look at how the perceptual salience compares between the first and the second you know that it actually there is a drop in the perception of salience which is not surprising. When your phone starts ringing by the second ring you are slightly less surprised and probably so by the third ring. And so when we look at the overall effect across a group of events you know that the salience does drop quite dramatically over a pretty broad range and so the acoustic change over long ranges up to eight seconds based on our analysis has this kind of mask and effect on salience. So this mask and effect seems to sort of recover or the salience strength does recover after about these eight seconds and subjects on average are able to refresh their baseline salience to react to sounds. Another interesting observation from this behavioral data is that age does seem to play somewhat an interesting role in that it does slow down behavioral salience. So this analysis shows a reaction time over a pool of 325 subjects. We obviously don't have a full range over you know sort of to make smaller brackets of age effects so these are groups sort of age groupings based on a balanced tiering of the data that we have. So this is a collection of the subjects that we have some were collected in the lab and those tend to be more college age students and some were collected online and so those we didn't really target specifically an older population but we do see a pretty significant change in reaction time to subjects that were 55 and older relative to the younger subject. However, we wanted to see whether this sort of slow reaction does affect salience judgments in this 55 and older group and that does not seem to be the case. So this curve that I'm showing you here is a measure of intersubject agreement or judgments of salience measured by this F score. So how consistent is each of these subjects is with the group and what you notice that there is really no big difference between these older subjects and the younger subjects except for this much younger group which tends to be noisier for this particular analysis though in different analyses you don't really see a huge difference in terms of consistency of salience judgments. So really only the main factor here seems to be reaction time slowing down. So next we wanted to both validate the definition of salience obtained with this paradigm as well as explore the neural underpinnings of auditory salience in brain processes. And so we're particularly interested in exploring how does the neural activity in response to a sensory stimulus change with different salience levels. So for this next study we presented subjects with the same natural scenes that I just mentioned earlier but instead of a dichotic paradigm subjects were asked to completely ignore these scenes. So concurrently while subjects are listening to these nature sounds or orchestra and they were just asked to ignore them in the background we presented them with a tone pattern that listeners had to pay attention to and detect when a subtle modulation was introduced as shown here with this modulated tone. So this this task was fairly engaging. We sort of had to adjust the difficulty level to keep the engagement quite high from subjects and the subjects were reminded throughout this experiment that the background, cafeteria nature or car race scenes were really irrelevant to the task and should be ignored. So with this paradigm, what we first note is that the background salient events did significantly disrupt task performance. So this in this setting we are able to confirm that the background salient scenes are in fact grabbing the subjects attention in an involuntary fashion. And then as the subject performed this task we collected EG recordings to explore neural underpinnings of scenic coding as modulated by salients. So using this data we quantified what happened to the brain response when a salient event was playing in the background compared to brain responses away from any event or near a target tone. And so because the attended auditory signal here has this inherent rigidity in this sort of pattern this tone pattern, we can compare the phase lock in sort of the neural phase lock in or change in phase lock in relative to the attended rhythmic sequence at different moments throughout the trial. And so overall what we see is that the phase lock into the periodic rhythm is pretty steady throughout the experiment. If we exclude these moments where salient event is in the background or the target is there. And so you don't see really any big changes throughout the experiment. But then when we compare the changes in phase lock in or when we look at the changes in phase lock in near and attended the attended rhythm. So basically near these modulated tones, we see a fairly significant increase in neural response to the attended rhythm near the modulated target. This is commensurate with changes that were reported previously with voluntary attention since this is an attended target. And then in contrast and what's interesting is this opposite effect. And so basically when a background salient event occurs, even though the subjects are ignoring it or supposedly ignoring it, we know that dramatic drop in phase lock in near these salient events. What is even more interesting is that the amount of drop in phase lock in is modulated by the degree of salience. So the more salient the event is the larger the amount of drop in phase lock in. So one of the questions here is whether this drop in phase lock in coincides with a shift in neural encoding towards the ignored stimulus. If the subjects are being distracted from paying attention to this rhythmic tone, are they really encoding this background sound more with more fidelity. And so to do this analysis, we obviously cannot look at phase lock in since the background scenes are really dynamic auditory stimuli without some inherent publicity for which we can measure phase lock in. But instead what we did is we looked at sort of an envelope reconstruction model using a rigid regression. And what we see is that when a salient event occurs, we do see a statistically significant increase in our ability to decode these background scenes. So this is commensurate with other measures indicating that attention is shifting away from the attended rhythmic sequence towards the salient events. So this change in response to the ignored stimulus does increase encoding of the stimulus during the salient events. So this confirms that when attention is diverted away from the rhythmic notes, it is diverted towards the salient events resulting in an improved representation of the scene envelope at that moment. So this push pull interaction is kind of interesting because it suggests that the brain is pulling on these common resources that it has to divert or sort of dedicate to either one of these or at least shared across these two stimuli that are competing for attention. And so a natural question then comes is whether there are shared brain networks that are engaged during these two forms of interaction between voluntary involuntary attention. And our analysis does confirm that there is definitely sort of a shared network that is being engaged during this task so for this analysis we wanted to compare the topography of voluntary and involuntary and involuntary brain voxels and so what we did is we adapted a classic technique using canonical correlation analysis for this purpose and so here what we're doing is we're comparing the activity the brain activity across brain voxels during these attended targets so voluntary attention to the activity or the topography of brain activity during the salient background so involuntary attention. And so what we did is we adapted this canonical correlation analysis CCA which in fact is a form of multivariate analysis of correlation where high dimensional data are compared in order to discover interpretable associations or correlations represented as data projections but for this analysis what we did is we imposed sparse constraints on this procedure which allowed us to improve interpretability of these projections by confining the map into constrained vectors and therefore identifiable brain regions. So this is what the S here is sparse CCA. And so from this analysis we get a matrix like this or what this matrix is basically showing us it's think of it as a cross correlation between the topography of brain activity during involuntary attention at different time lags and the topography of brain activity during voluntary attention at different time lags on the y axis. And so what you see here is you see there is a significant correlation between these two brain regions that starts at about one second with involuntary attention engaging common brain circuits about half a second sort of one second here versus 1.5 you see this off diagonal activity. So with as I was saying salient events or involuntary attention engaging these networks at about half a second prior to activation by top down attention. These overlapping brain regions with significant correlation when we look at them closely appear to span inferior and middle frontal gyrus as well as superior prior to a logo. So finally, I wanted to touch on some of the complicating factors associated with the study of auditory salience and ultimately, how do we model or develop algorithms that can decide what events in a scene are deemed salient or important and maybe should be encoded with higher fidelity than other events. Now we explored how well does our current understanding truly reflect or predict whether sound events is considered salience given its context or not. And we looked into a number of models in the literature that were developed to explore auditory salience. So for this analysis what I'm showing you here is this ROC curve which quantifies correct salience detections by this model. So correct detections on the y axis versus false alarms on the x axis. And this black curve here is basically quantifying inter observer variability, which we're considering here as an upper bound and to how well the any of these models can perform. So one of the earliest model. And most of these models before I mentioned that basically performs some kind of acoustic analysis with the differences in terms of level of detail and representation of these different features and then some kind of integration across these different features. So one of the earliest models that we looked at is a model by Kaiser at all, which basically adapted the center surround idea from vision to an auditory salience model. So the plot here shows the ROC curve obtained by this Kaiser at all model, which basically, as I said quantifies correct salience versus false alarms. And so what we see here is that the model is rather limited in its ability to predict the presence of salience events in these dynamic scenes in this paradigm that I discussed in today's talk, largely because it treats the auditory spectrogram as an image and performs a vision like analysis on the time frequency image. So mostly ignoring the temporal structure of sound as they evolve in time and, you know, and not really doing any special treatment to the spectral dimension. So other models that we explored from the literature do improve our prediction capability so we see here to relative models one of their main limitations is really a linear treatment of the acoustic dimensions so they ignore the co dependencies across the different or contributions of the different acoustic dimensions. So in other studies we considered both a predictive coding model with nonlinear interactions so basically allowing these features to introduce co variations or co dependencies between them. And I'm not getting into the details as to what these acoustic features are there's variability and changes across these different models in the literature, and even some of the models that we developed here, but really at the high level analysis performs some sort of comparison or integration across these different features, either allowing nonlinear interactions or not so for these better models that we're seeing that give us this improvement. The nonlinear interaction is an important aspect that allows the allows us to. Achieve better predictions of auditory salience. And so these two greenish lines are these two models one that is based on a predictive coding which allows a tracking. Over time of the dynamics of the scene. While the second one is a nonlinear interaction model that integrates nonlinearly across features and while both these models do yield improved predictions in our ability to flag what is salient in a scene. The main takeaway, sort of take home message from these approaches is that auditory salience is really far more nuanced than just a linear combination of acoustic attributes. And so, and there's still a fairly large gap in how well any of these models are able to perform. So in some more recent work we explored how well we can improve predictions by considering a convolutional neural network that is able to learn the co dependencies between acoustic dimensions in a dynamic. Everyday soundscape and so this model is not very deep because we don't have a large amount of data for which we have salience judgments. However, what you note is that the. The ROC curve this orange curve using this approach does show improvements over these previous models that I mentioned. And really what the contribution of this convolutional approach is really in the way it integrates in information across these different dimensions so there's really just. You know, sort of a more detailed approach or mapping of these acoustic features onto a salient space, but nonetheless, you know we see a small improvement but we're still fairly far from the this upper bound of being able to predict human judgments here. And so what the model is doing is likely learning more complex nonlinear interactions between acoustic features than any of these other models but it still remains an acoustic model. And so finally what we were curious about we wanted to test this hypothesis that what drives attention is far more than just the acoustic structure of the sound, but also incorporates some of the semantic interpretation of the scene itself. And so, you know, as I've been saying, the sound of a fork is probably far more surprising in the middle of a classroom that it would be in a restaurant. And so taking into account this context is probably an important aspect that needs to be incorporated in these models. And so what we did is we extended these salience model by including and in addition to acoustic attributes we include what we sort of map as a semantic vector which basically reflect the output of another neural network that is trained not on salience but it's trying to identify the scene or the events in a in a in a scene and so this is a large convolutional neural network that is trained on fairly large amounts of data. And what this model gives us here that complements the acoustic analysis that we perform is it basically gives us this semantic interpretation or contextual interpretation of the scene or the environment where we are. So these that we're able to achieve a far larger improvements you see here in this red curve in our ability to predict which events are salience and which are not, though obviously we're still not fully, we haven't closed the gap with human perceptions suggesting that there are still a number of elements of this auditory salience that remain to be explored. So, overall I think when we think about what we know about auditory salience it is obviously a multi dimensional nonlinear and dynamic process that salience is continuously modulating neural encoding of auditory stimuli that providing a tight interaction between sensory representations and cognitive feedback, and that models of auditory salience do need to consider the complex interplay between low level and high level representations for auditory information. Obviously this is not a, you know, there remains a number of open questions that need to be explored the integration across multiple timescales this role of familiarity semantics and prior knowledge, and salience is not a one to one context can change our interpretation of a sound. A number of other challenges remain you know the behavioral measure of salience remain indirect so they're not really giving us a direct measure of how we our attention is being modulated by different events around us and so that some of the questions that remain is really measuring attention versus detectability and is there sort of a big difference between the two. There is really no common data sets that would allow us to really validate across these models and these models, by and large remain fairly static and really themselves modulated by attention. So, there is a lot of questions that remain to be explored in this space so with that I'm just going to acknowledge the work obviously if all the current and former students in the lab and postdocs who've done pretty much all this work as well as some of the funding agencies and I'll thank you for your attention.