 Okay, so I'm here today to talk to you about some work I've been doing in Linguistic mediation of visual saliency and a unique analysis that is able to incorporate large data sets and infer some differences in different language contexts. So before I begin, I like to acknowledge my collaborators, Jeffrey Viat, who graduated with his master's in experimental psychology and my lab, which is called the Lab Lab or Language and Behavior Lab. I have several undergraduate assistants that have worked on this project and thank you again for inviting me. So we use language in a lot of different contexts. We use it in various social situations, like when we're talking to undergraduates in the lab and telling them how to analyze data, as this picture clearly shows. We also use it in very complex dynamic visual environments that change as we move around the world and move our eyes and explore our environments. And we even have various different auditory contexts, whether that's a noisy room that you're speaking in or you have to compensate for an impoverished speech signal or perhaps in auditory contexts like hearing loss or other speech processing disorders. So my work mainly focuses on this question, what are the principles and mechanisms of comprehending language in realistic contexts? So there's a lot of different sources of information that we use in language processing. It's very context sensitive to a lot of different things. So we don't have a very good understanding of how exactly we know to draw information from these different sources and incorporate them during language processing. If we did, we would be able to have conversations with computers that included things like sarcasm and discussion, having jokes and things that we naturally do with language every day. So one way to think about context is where the source of information is coming from. Where it comes from is going to affect how it's incorporated and processed during speech comprehension. So there is visual information that we use to incorporate into the context of sentences. So for example, if I were to just wander up to you and say, the eagle is not in the sky and just walk away, that makes no sense. It does no context for it, right? By itself, it's underspecified. If we were out bird-watching and I point to a tree and I go, the eagle's not in the sky, suddenly it's a perfectly reasonable sentence to say. So your visual context continually shapes whether or not you'll be able to comprehend a sentence. I think this is most obvious when you're trying to debug a friend's computer over the phone and it takes two hours when, if you were right there with them, you share the same visual environment, it would take 20 minutes or so. So visual information, it's very easy to use. Having a shared visual environment is very powerful during language processing. The other thing about language is it serves as a context for itself. So this classic sentence, it's a garden path sentence where you initially have some ambiguity and then resolve it by the end of the sentence. So the sentence, the horse race past the barn fell. How many of you are familiar with the sentence, first of all? Okay, just a couple of you. How many of you understand the sentence? Okay, only one person. So now let me give you a paragraph and see if you can parse the sentence correctly. So if I say earlier this month the horses were in training for an important race to test out different training techniques. One of the horses was raced past a barn in a quiet corner of a farm. While the other was raced past a noisy schoolyard, they did this every day for a week. In the end, when it came to the big race, the horse race past the barn fell. Does it make sense now? Yeah, so I've added linguistic context. Now you know that there's two horses. I'm clarifying which horse is being raced and that that one fell over. So it's a fell barn. What was that? Yeah, so this is one of the most difficult garden path sentences and I hate it in a lot of ways because people almost never understand this even by the end of the sentence. A better one is something like the man who whistles tunes pianos. That one has a temporary ambiguity, but you resolve it by the end. But this one I think it's well illustrated by when you add more linguistic context then it becomes unambiguous. Okay, so we also have social context. For example, if I'm talking to a guy with a really nice beard and I say what a lovely beard you have, that is a perfectly acceptable thing to say. However, if I am talking to a woman and I say what a lovely beard you have, suddenly the meaning is very different. So your social context also continually plays a role and shapes what meaning is. The last type of thing I'm going to talk about here is long-term memories. This one is very difficult. It's more effortful to incorporate while you're having a conversation with someone. So if I ask you what kind of eagle did we see in the sky last week, you have to search your prior memories. You have to maybe visualize what eagle you saw and then turn that into linguistic information. It's very hard for people sometimes to access prior memories. We don't like to do it unless we absolutely have to. So in more recent research I've looked at when you give people information, even if it's completely false, they will rely on that as opposed to their prior memories. So they rely on the more recent information. So you can tell people the layer of fat around a whale is called flubber. And even if they know it's called blubber, they will reply, I believe it is called flubber. So they don't take that current information they're getting and try and reconcile it with prior memories. They just kind of go with the most recent information they've encountered. Okay, so in my lab it's all about context. The goal is to begin describing at a more computational and algorithmic level how we do this kind of language processing. It's very fluid, it's dynamic, it incorporates all these different sources of information that we have to switch between and eventually lead to systems that can use things like sarcasm and empathy and use behavioral and visual cues that lead to richer understanding. So our current understanding of speech perception at very low levels is quite good. We have very good models of language processing up until about sentence processing. But when you get to this pragmatic level, it starts to be very complex to describe how exactly you do this because it's so interactive. So some examples of other work I've done. I've looked at how we process language when you have two visual references and you have to build meaning in the moment. It's a different kind of mechanism than when you're doing something like recognizing a familiar object. So if I say the towel is not on the rack and they have to look at a towel on a rack and a towel on the floor, they have to build that meaning in the moment. They have to accumulate the information in front of them and then form a response. So they don't come into the lab with representations in their mind that they're accessing and fighting between, which would be a more competition mechanism. They have to build it in the moment. So there's a lot of different characteristics of language processing at this level than there is at the lower levels of language processing is what we've been finding. I've also looked at different work looking at the linguistic context, different behavioral contexts, where you take out the task and just have people listen to different kinds of language and just look at a blank screen to see how it affects their behavior. And as I mentioned, the research on prior knowledge, looking at how people process information when it conflicts with their prior memories. So largely what I do in my lab is track behavior in real time. We use a Tobi eye tracker, which just sits at the bottom of a computer screen and it picks up on people's pupils and how far away from the computer screen they are. It records their eye location at 60 hertz, about every 15 milliseconds or so. And typically in these language experiments, they're viewing images and listening to speech over headphones and clicking on a picture on the screen. So it's a very specific constrained kind of task. And why this works is because eye movements are very closely time locked to speech. So within 100 milliseconds of the onset of a word from the very start of it, people will start to look at a picture that sounds like that word. So from the moment you start to say the word bear, anything that starts with a B, people will start to look at on the screen. So it works very, very well for a window into processing and seeing what's going on in the mind as it's happening. So you might see something that looks kind of like this where you start out, you look at the center of the screen and you hear click on the pair, or I'm sorry, click on the peach. And you have the peach and you have the beach, which is just one phoneme different from peach. So you might look at the peach, look at the beach, and then click on the peach. It's a very simple task. There's two unrelated items on the screen. You might not look at them much. But what's important here is that anything that sounds similar, anything that is similar semantically or visually will compete for your attention on the screen. And you can see how much it competes for your attention with eye tracking. So one of the classic studies in this area shows how the visual environment shapes your sentence processing. So in the picture on the left you have the apple on the towel, you have the pencil, an empty towel in the box. And when you say put the apple on the towel in the box, people might look at that apple and then look at the towel and then look at the box. Because when they hear put the apple on the towel, that could be the end of the sentence right there. So they end up looking at the incorrect destination before they look at the correct destination. Now with the one on the right, you have two reference. You have two apples. So people here put the apple on the towel in the box. They know that put the apple on the towel is disambiguating information. It tells you which apple you're supposed to be drawn to, you're supposed to put into the box. So you look at the apple, you might look at the other apple, might not. But typically you do not look at that incorrect destination at that towel. So in this case, this study has led to about 20, 30 years of research of this sort of style of changing the visual reference and changing the speech to see how those interact and really dig into minute details of language processing. But I've always been curious about how well this generalizes. So typically we're not sitting around on a beach with a peach and just asking someone to click on the peach. We're having real conversations and real situations and using humor and we use it for communicating with one another. So in a very recent study, they asked what happens when you're processing language when the visual reference go away. So instead of simply looking around at the screen, use your memory to search and see if the image is there. So what they did is they have three pictures on the screen and it's either there when you hear banana and you have to say whether the banana is there or not. It's all you say is presence or absence or the image of the banana would be there in the target present trials. I'm only going to talk about these target absent ones though. So with banana, you have something that is visually similar, the canoe, you have something semantically similar, the monkey, and then you have an unrelated item like a hat. Now when the pictures are right there in front of you, what people tend to look at most is the visual competitor, the canoe. They look a little bit less but still look at the semantically similar picture, the monkey, and not quite so much at the hat. Now when you change this slightly and you have the images there and then take the images away and ask people is there a banana? Well previous research has shown is that if you say banana and the banana was on the screen, you'll look to the region where the banana was. The question is do they also look at the related items when they've disappeared? And they don't. So when those visual references go away, even if they were just there on the screen, suddenly there's no competition. So just that tiny change in those visual reference being there or not being there, suddenly this mechanism that we thought was very robust that was always there is not there anymore. So this might not be as generalizable as we thought it would be. So what I wanted to do with the research I'm going to talk about primarily today is see what happens when you don't have a specific task, just like when you're sitting around having a conversation with someone, you don't have a very specific goal like clicking on a peach. And you have speech that's very loosely coupled to the visual environment. So every once in a while I'm using these slides as reference for my speech but it's up saying things that also are not on the screen. So regular conversation has this loose coupling to the visual environment. So in order to do this, I'm going to look at visual saliency. Visual saliency is a quantitative measure of where approximately you will look at a particular static image. So I have this picture here of the different colored houses. A visual salience algorithm will take things like orientation, contrast, color, and predict the areas that people are likely to look at. You can then collect data from people with eye tracking and measure those regions and see where they look and see how it aligns with your predictions. So in this particular one they tend to look at this green door, they look a little bit over here and a little bit at the silhouette of the woman. When you look at the visual salience of this image, it looks something like this. So in here the white areas, those are the very salient areas and the black are the non-salient areas. And you notice there's always this center bias. People tend to look at the center of images. And you can almost see the image there. It's very, there's a lot of contrast, there's a lot of lines, and so there's many different regions of salience. So in this experiment we had images like this where there's lots of different things to look at, like these bond icons. These are icons for all of the different bond movies over the years. We have Moonraker right here. And so there's a lot of stuff to look at. There's a lot of things to think about. And in the salience map of this, there is a lot of contrast and you can almost see the image. So again, there's this very high resolution, lots of different alternatives on the screen to look at. Other images we used, it's very what I would call low entropy. These images you have regions that are predicted to be salience and very large regions that are predicted to not be salient. So you have things like this. This is the salience map of the parents. So what we did is not just have people view these images but introduce language. So there are two conditions, one in which we have the language presented before the image and one during. Do you have a question? I will talk about the particular algorithm that I used in just a minute. So the language that I used, we had little paragraphs that somewhat related to these images. And some of them were in an affirmative form and the other half were in a negated form. Now I'm going to spoil it right now and tell you the negation has no effect on anything. What's going to be interesting here is whether the speech is presented before the image or concurrent with the image. So this is an example of one of the vignettes that was played. Will decided not to paint his house a bright color. He does not prefer a building that is painted vibrantly. Will did not often visit brightly colored homes. Sometimes Will would not want to stay at home. So it's not very interesting but it's language and they were all sentences that we could negate in kind of the same way. So in order to analyze this data, what we needed to come up with was something that could compare how predictive a model of visual saliency is for the language during and the language before. And what we found is an analysis called receiver operating characteristics. How many of you are familiar with a receiver operating characteristic analysis? Excellent, a few of you. So for those of you who are not familiar with this, this is an analysis that was developed during World War II to detect enemy objects in a battlefield. They take radar data and in radar data you want to be able to tell if a bomb is coming at you or it's a cloud. And so being able to detect a particular pattern in very noisy data on a single trial is incredibly important. In this case might be life or death. And so they developed this analysis that parses out the signal from the noise. So all of the graphs that you'll see are these graphs where you have a what's called a line of discrimination which means that it's all in the ways. This line of discrimination means that it's not predictive at all. If the curve goes above this line it means that the algorithm is predicting the data. And if it goes below that means it's not predictive at all. So this was used in the 1950s in psychophysics to measure detection of very weak signals, psychophysical experiments. This has also been used more recently with multi-dimensional data like eye movements. There's only a couple studies that have used this with eye movements. It's been used with fMRI data to look at individual subjects. So if any of you are familiar with fMRI data it is very noisy. And you have to aggregate over multiple subjects or else it's just all noise. This analysis you can look at individuals and get reliable data. It's also used in EEG pretty much any large data set. It's often used in the evaluation of visual salience algorithms. So there are several different variations of visual salience, how to parse that picture to determine which areas are salient versus non-salient. And the higher the curve is the more predictive the better your algorithm is. So it's very common in that area. So the only limitation here is that you need some kind of baseline or a prediction of what your data is going to look like. With that you then put your data on top of it and see how much of it matches essentially. So the area under the curve is how much of your data matches that baseline. If you have a very low area under the curve less than 0.5 it means you either don't have a good baseline or the data that you've collected is not aligned with that baseline. But if you have it higher than that 0.5 then it means that you have both a good baseline and the data is matching it. Okay so again the fixation data that I used participants either had the audio first or the audio concurrent. Participants that had missed lots of missing samples were discarded. There's no task here. They were just asked to sit and listen to these stories and they viewed each image for 15 seconds. So that way we have approximately the same amount of data for each trial and each image. So the question is how well does purely visual salience just the visual characteristics predict viewing behavior in these different language conditions. It could be that people completely ignore the language altogether and just move their eyes based on the visual characteristics or these different language conditions might be mediating people's eye movements, changing the behavior a little bit and suddenly that visual salience becomes less predictive. Okay so a little bit of details of the visual salience algorithm I used. It's called graph-based visual salience. It's a somewhat more recent model of visual saliency. It extracts feature vectors from the image, various features like I mentioned the color orientation. Then it forms an activation map that is continuous between one and zero over those feature vectors and then normalizes it. So it's non-binary values between zero and one but for the ROC analysis it needs to be binary. It has to be either the eyes are there or the eyes are not there. Either it's salient or it's not salient. So what we did is applied a binary mask and it's sampled at different rates in increments of five percent. So what we get is an image that has lots of regions predicted to be salient or very very few regions to be salient. I'll show you that in a second. So what we do is classify each fixation as either a hit. It hits one of those salient regions, a miss, it hits a non-salient region or it was completely you know they were looking at the wall because sometimes they do that. These are undergraduates so they look at all kinds of things and we do this for each individual trial. So this is kind of amazing to me that this works because eye tracking data is very messy and on each individual trial you can actually see the ROC curves. People are looking at salient regions. Okay so this is what the graph-based visual saliency might look like originally before the binary thresholding. It's a continuous value between zero and one where the white regions again are the more salient and the black is predicted to be non-salient and that is the salient's map of this image. Now when you apply the binary mask it turns into this and this is at the 85 percent sampling rate. So not all of the regions are that could be salient are represented here. It's a little bit less. So back to this image this one that's a little more high entropy there's lots of contrast. When you start to down sample it starts to lose some of those points and it is biased again toward the center so you start to lose a lot of those edges. So of course when it's at zero percent saliency none of the fixations are going to hit salient regions because none of them are predicted to be salient. As it grows as it gets larger more and more are going to be salient regions so that's where we get that ROC curve. Okay so qualitatively in this study when we first looked at the data we went this is amazing. It looks like the affirmative language and the negated language are doing different things. When they hear the negated language what our predictions were is that people would look for alternatives. They would look around and try and see what else is around that you could be referring to since you're saying not the green door they might look at the blue door the yellow door and that's what it looks like they're doing. But when you do the ROC analysis the lines are right on top of one another so this is actually two lines for the affirmative and the negated and they're right on top of each other so it was not at all a difference in how it drove eye movements to salient regions or not. Now this is the language at the same time the audio concurrent so we're looking at the image they're hearing the speech the overall model is significant it does fairly well at predicting people's eye movements but again no difference between that affirmative and negated. Now when you play the audio first and then they look at the image it changes to this so again they're right on top of one another the negation makes no difference in this analysis but it's a lot more predictive so the visual salience is driving people's eye movements eye movement behavior more than when it's the language at the same time so the language is exerting an influence it is changing and mediating people's eye movements at the same time and doing something that just purely visual salience cannot predict so a comparison between these two experiments they are significantly different from one another the audio first is better at classification than the audio concurrent. So one of the things I like to play with data and one of the things I was curious about is how robust this analysis is in face of individual differences if any of you have worked with large data sets from people people tend to be very different especially eye movements eye movement behaviors are completely different from person to person so we might expect that this would be all over the place for different subjects so this is what the image variability looks like this is each individual image in the audio concurrent only so there's a good amount of variability it's it's uh you know there's there were a huge range of images so some of them the algorithm was very good some of them that was a little less good this is the subject variability this is almost unheard of I don't know what other analysis is so robust in face of individual differences so no matter how different your individual subjects are this seems to be able to pick up on trends that are driven more by the image characteristics rather than the individual in the audio first of course this algorithm overall was better at predicting eye movements when the speech was before the picture so there's not quite as much because we're kind of pushed toward the ceiling here and the subject variability again is very small so there's not a lot of differences between participants when you look at the data this way so this is a really sensitive analysis it's really robust in face of subject variability which is typically quite large in eye tracking or any kind of noisy data set and what's really cool about this is you can actually test significance on individual trials I did not do that I collapsed over many many trials of different people but you can do that so it's a very sensitive measurement and if you wanted you could test a very much smaller data set um so what appear to be differences in looking behavior what looked like there might be differences in the affirmative versus negated it didn't seem to make a difference in this analysis it was still guided by the salience of the image so those regions that they looked at even if they were looking at more diverse regions on the screen what they were looking at was still salient so it seems to be driven by that characteristic so one last thing I'd like to tell you about one last analysis we did is look at these very high entropy images versus the ones with lower entropy where there's these large areas of salience so if you hear something like she did not see the 007 logo you've got a nice 007 logo right here you have a bunch of other things to look at if I say um he did not look at the parents what else do they have to look at not a whole lot so there's not alternatives available when they hear that negation so what I looked at is whether the visual image had alternative reference available or not and there is a difference so in the audio concurrent only when they have the language and the image at the same time there is a small difference between these images that have alternatives available and ones that do not have alternatives available it's more driven by the salience when you have those alternatives so but in a very small way so this analysis it does an excellent job with multi-dimensional data sets that have some kind of baseline that is all the qualification you need for this you can do this with n-dimensional data not just two-dimensional eye tracking any number of dimensions you want uh GBVS it it's a pretty good model of visual salience it's fairly good at predicting eye movement behavior under different conditions um this could be done of course with any model of visual salience or any kind of predictive algorithm you might have and people might be looking around a little bit more with negated language you might be looking around for different reference but what you're looking at what you're going to be driven to are still regions that are salient so there are visual characteristics that are still guiding that behavior so when is language what's language really doing here when is it changing eye movements what I think is going on is that it's mainly the task it's when the only task is to listen behavior is a lot different than when you have to click on the peach or when you have to do a very constrained task and in fact the very first eye tracking study showed this they showed people images and they asked them questions like how affluent do you think this family is versus um what kind of event are these people preparing for people view images differently depending on the question that you give them um this is the title of a paper that I thought described this very well saliency does not account for eye movements during visual search visual search is a very particular task you're asked to look for something amongst lots of distractors and so the way you do that is very different from other kinds of looking behavior the salience does not seem to drive in that case so semantically and visually based competitors they have these fixations during word recognition in these visual world paradigms click on the peach but looks to blank regions where those competitors where are no different than those unrelated regions so back to the study I talked about with the banana and the boat and the canoe if they aren't there there might not be that competition so potential applications I know a lot of you do lots of different kinds of research but I feel that the ROC analysis is a nice sensitive analysis that could be applied in lots of different areas and I don't think it's being used as much as it could be so of course in psycho linguistics things like eye tracking um even motion tracking and models uh that predict body movements could be used there in movement of robotics uh it could be used for there's plenty of predictive models and economics and politics machine learning all of that can it could be used you could use it to compare different analysis techniques and fmri and eeg data and even looking at something like experts versus novices um expert dancers versus novice dancers and comparing their performance so those are some of my ideas and I do have code available for this if this is something you're interested in uh this is something that I programmed in matlab uh if you are a matlab user uh just send me an email and I can send you the code and thank you for letting me mediate your eyes. Any questions?