 Thank you for giving me the opportunity to spend some time here, even though this is a short visit, just 10 days, but I'm very glad to be able to interact with all of you, and also to give this talk, and thanks for the introduction. So yes, I'm going to be talking about languages in humans and machines. And so, basically what I'm interested in is language right so natural language the language that that we used to communicate with each other. So I think that this is a fascinating ability that we have fascinating skill that allows us to exchange ideas to socialize and also to coordinate in achieving things together by communication. And so my field of research is computational linguistics you may also be familiar with the term natural language processing right it has become also very popular. Both of them are almost interchangeable, but I do have a bit of a preference for the term computational linguistics because I think that it makes very explicit the interdisciplinarity that is part of this field. So the fact that there are different fields that kind of come together in computational linguistics, and the things that I'm interested in have to do with understanding better how human language works that has to do with linguistics I call linguistics cognitive science by using computational methods that often comes from what most of the time come from machine learning from from AI from computer science and so on. So oftentimes thinking about how ideas from linguistics and psycho linguistics can inform the development of NLP models of natural language processing models. Okay, so these are the kind of questions that I that I'm interested in, and that will be part of what I will be talking in this in this talk today. So the last few years, as you may know, the field of NLP computational linguistics has been revolutionized by by many, by many ideas, right, particularly in the last year. So actually today is the anniversary of the launching of chat GPT. So some of you may know the 30th of November, a year ago, it was when it was made public. It was in a rather crazy years for computational linguists. Suddenly, all our research and all the things that we do were in the public domain and we have to kind of figure out how to respond to this how to talk to the public, etc. Things for which we have not necessarily been trained. Right, so it's been exciting. Also, yeah, a bit exhausting, but exciting, definitely. So in the last year, but in the last few years, the research that is being done in NLP and computational linguistics has been powered has been driven by what has become called foundation models, right, mostly you will know larger than what Stanford University coined this term of foundation models. And this is just a kind of an extension so this just refers to very large machine learning models that are implemented as artificial neural networks. Nowadays using this kind of transform architecture, a complex very powerful type of architecture with lots of attention here and there. And this is a key methodology that now is used all across the board for developing language technologies but also for doing research within computational linguistics and also from a more cognitive perspective sometimes right. Yes, so foundation model is like a generalization term, it does not refer only to language based model, but it all started really with large language models, right, so foundation model is just classed as the same type of mechanisms as a large language model but does not necessarily have to do with language, right that's why it's a generalization. But it all started with this type of models, this type of transformer base deep learning architectures being applied to language and that's the large language models. So here's the overview of what I want to talk about. I want to in the first part of the, of the of the talk, I will be focusing on this type of foundation models large language models. These are condition models that are trained only on text and, as I said we're the first of the, of the first type of condition models. And I want to explore two questions whether whether we can use ideas from human language processing to evaluate the generation capabilities of these models this will be completed a bit later. And also, we whether we can use this larger language models to crack quantify human processing effort, which is a notion that is very important in psycho linguistics, the idea that it takes effort to process language. And then in the last part of the talk, I will look at models that have to do with not only language but actually that have been trained with other modalities in particular language and vision language and images. So sometimes these are called multimodal foundation models or just multimodal models. So they go beyond just large language models that are trained only on text. And here the questions that I will be asking is are whether whether the knowledge that the that these models that in principle have more information that just large language models, whether this knowledge is aligned with humans linguistic intuitions. And in the end I will just give you some ideas of some work in progress on stuff that we are working on now, which has to do with figuring out whether these multimodal models can also help us to understand what's going on in the human brain when we are processing input that that comes from different modalities on division and the linguistic modalities. All right. So that's the plan if you have questions in the meantime then please just ask me. I will just go on like this. So first let's start just by taking a look at the basics of large language models. So just the mechanisms that are the very basic mechanisms behind behind language models. So large language models are trained on tons of text right so on very very large amounts of text that come from the, from the available available free on the internet at whatever you can grab basically. All right, so they have all this training data, this consists of text this consists of sentences, etc. You can grab so a possible sentence in this training set for a language model a training example maybe just a sentence like this, right many, many commuters travel by Trump. Okay, so we have a training, a training sample a training sentence like that. Then we mask a particular word, for example the last word in the sentence. So we let the model predict what word this maybe. The model knows all the vocabulary of English, if this is an English on the model, right so it knows what are all the words in English. And this would be all the words that appear in the training set for example, all the type of words that appear in the training set right. So now we have the model okay among all these words that you know which one is the one that appears in this sentence. And this is a probability to each of the words that it knows a probability to each of the words in the vocabulary. And this probability at the beginning is going to be very off because the model doesn't know much right so it's processing the corpus etc. And at the beginning it will be quite random this probability. But then, once the model sense a probability to each of the words in the vocabulary, then the actual work in this particular training example is revealed. And therefore we have a target distribution that should have applied to this example so the word from was here the actual word, it should have received a very high probability right, and this gives us a training signal from the model for the model. Right. The training is the training sample is rubbish and now we know how we need to update the model parameters so we need to increase the probability that the model as I said as I'm so Trump in this context, and then decrease the probability of all the other words. Right. So this is the basic mechanism now if you do that. If you do it many times right millions of times and so on. Then the model is eventually going to learn to predict the right probabilities of what is coming next given a context, very well. So we will do, it will learn very well, and it will learn things like for example the fact that in this context, after the world buy. There are many words in English that actually are not really plausible combination continuations words like yellow speak window off and many other English words would not feel very well here. The model will learn this very well, and it will also learn that there are certain classes of words that some things in common that would be plausible continuations in this context. So of course this is much more complex than these the models have many layers, right they are very deep so these predictions are done at different layers that are lots of competitions going on, but essentially, the core mechanism has to do with with this kind of prediction. So, through this sort of learning, and through the complex architecture and so on and through the cheer amount of data and the complex number of parameters and so on. These models are really able to learn a lot of knowledge about language and by product. And also, what some knowledge about the world that is present in the text that they are trained on, right. And in addition, they also can be used as models to generate language, like you very well known if you have interacted which are GPT or the types of generative AI models. Right, so we can not only use to make this kind of predictions but actually to generate language so really predicting one word at a time and one sentence at a time, and so on, but they can use to generate. Right, so a very key problem within the natural language processing community and so on, is how we should evaluate these large language models. It's actually very difficult I mean in general to evaluate the generation of language so energy that's natural language generation, so feel of natural language processing is a is a field that has been very concerned with evaluation because it is really very difficult to the goodness, how fitting a certain generated generated text it is because there are so many dimensions that one could consider, right, is it good with respect to fitting the context, is it actually fluent, for example, right this is appropriate in terms of style so there are so many dimensions and how you can evaluate it so this is a complicated, definitely very difficult to evaluate the generative capability of large language models. A proposal that they want to make is to try to evaluate, not as the only method but there's an addition on it. To evaluate large language models through the lens of human production variability. I'll tell you in a moment what I mean by that. And what I want to investigate is whether language models large language models and lamps are able to reproduce the variability that we observe in human language use. All right. So what do I mean by human production variability. So, when we speak when we use language when we talk to each other and so on. In any given context in any given communicative situation if we're in the middle of a conversation. We may be able to continue by saying quite a few things right so typically, particularly if we're in a dialogue, we are not so restricted to only be able to express one idea right we can continue in any different ways so if we have this conversation here, can you help me please sure if I can. I want to send this small parcel to Canada. There are quite a few things that could continue in this conversation that could go on in this conversation, for example, the, the disability could respond by say so what do you want me to do, or another possibility to whom. There are 10 to 14 working days to to reach and so on, right there's five possibilities that actually were proposed by five different speakers in this case. Okay, it is clear that it is this variability. What can be said in this context can vary to some extent. This is not the case for all communicative situations of dialogue is one of this conversation is one of these situations, maybe the most common in which we use language, where the variability is very large. Right, so we have a lot of possibilities at our disposal. This is not the case in every communicative situation where we use language right. So for example, imagine that we are in a situation where we need to translate the language. The communicative situation is such that what has to be said is clear because you have it in one language and you just need to translate it to another language. So the variability of what I'm calling here intense right the variability of what to say is very constrained in a translation, communicative situation right, but even when the context, in this case the sentence of a given language constraints that we need to say, there can be a lot of variability about how to say it. Right, so the possible realizations, how we, the words that we choose for how we say something. So, right, beyond what we what we say how we say it, can also very quite a bit that's one of the beauties of language that we can actually express the same things in so many different ways. So consider this this translation scenario, we have this sentence in English several companies have that that's far reacted coiously when it comes to hiring and five possible translations into German. And you will see here if you know English and German, German is quite bad. You will see that the phrase several companies can be expressed in different ways in German, and here, some of them are the same some of them are different. Right. And the same for the phrase reacted coiously. It doesn't have to be expressed in the same way all the time right so here, they intend what is being said is constrained but how we say they can be quite some variability. So, there is this kind of variability in human language production. Right. Now, this variation of our ability is also present in LLMs. Often we talk about this variation or variability in terms of the uncertainty of the model, how uncertain the model is in what the model can generate. Right. Similarly, an LLM considered as a text generator is a probability distribution over productions that these over sequences of words or over sentences, given a certain context like a prompt that you give to your language model will give in a context, then the model just specifies it's just a probability distribution over possible sequences that can come next. Okay. So if we look at it in this light, then we can ask the question that they was mentioning before. Right. So is this generation potential or sometimes we refer to this as representation of uncertainty but there is no need to use these words right is this generation potential of the LLM in compliance similar to human production variability that we see in human language use. This is what I want to investigate and use it as a dimension for evaluating how good the language model is. Now you may ask me okay why should we care what is this a good criterion at all. Right. And so what do you think this is a meaningful question to ask. What is the sense that an LLM reproduces to some in some way or other, this variation is this potential for flexibility or whatever you want to call it off the of human language use. What do you think. Hello. Whoa, cool. And you could think about using, you know if they do if it does generate the same kind of variability this could be very useful for various chatbots that are used in health care, for example. So a lot of people are trying to use them for like mental health care triaging. And, or even mental health counseling, like an AI therapist, if you will. Right. Yeah. But depending on the context and the right so maybe. So, if I understand you correctly, even if we match the level of variation in this probably distribution there are many things that we will not solve, and I completely agree. So I think that's, well, I think that they are compatible this one says right so I think that for me, like you were saying, there is value in trying to get this kind of much because we want to use these technologies to speak with humans right to interact with humans. So it seems to make sense to let to that they have very similar capabilities so that those are the expectations of the humans somehow match what the model is going to produce. Right. But, but just by by checking over by enforcing over having these as a as a visitor item we are not going to solve everything we are not going to add intents to the model right so of course there is a lot more work to do to to create a conversational actually natural in an actual conversational partner. Okay yeah I added this here because, as you see the kind of questions that I'm asking are driven. And that I think this makes sense that this kind of question makes sense, but that's not always the case right so and in some situations in AI, there might not be value in trying to device AI systems that actually behave like humans because maybe other type of behavior is more effective for a given task. Right so some machines, I mean machines may do tasks better than we do and in some situations that's what we want right we don't want to reproduce how humans do things but do things better. And when it comes to language particularly if the point is to have machines interacting with humans then I think that this makes a lot of sense this kind of perspective. Okay. All right so yeah let me go. Okay so we are here. All right so we are going to try to figure out this question here. So we want to kind of compare right so the model generation potential to the variability of human language use. Right now, we cannot compare directly this probability distribution to the human population because we don't have access to the whole human population and we don't have access to the whole probability distribution. But what we can do is that we can sample from these two processes human process and from the LLM process. So given a context that is fixed that will come from an actual human context human language use, and we can use our LLM to sample a set of productions, a set of possible continuations. And we can do the same with a set of humans right so in fact the dialogue that I showed you at the beginning as I said were, these were continuations that were produced by several human participants. You know sampling from the human population as well. So now imagine that for each context, we have this set of samples one automatically generated by the LLM and another one generated by by humans. So what we're going to do is to check the, the to measure the variability in these sets in each of the sets independently. So we're going to define some measure, some pairwise measure, and we are going to measure the variability within all the automatically generated productions and all the human generated productions. We can define different types of measures, for example, from possibilities could be semantic variability. If we have a kind of vector representation and embedding representation of each of the sentences. Then we could check cosine similarity between the sentences. That's a measure of semantic variability of how similar the meaning of these productions are, and there are other possibilities right we could also measure lexical variability so the degree of word overlap, for example, right that this has to do more with how things are expressed. The yield of overlap and other measures are possible as well that are more syntactic for example how structurally similar or different these things are an endless, and others as well. Right. Okay, so we are going to measure the variability in the automatically generated set and the human set. And what we are going to do is just compare measure the distance between these two sets of distances of variabilities. Okay, here you just have an example of just to exemplify what we are doing an example of another dialogue context right dialogue it's very dark in here will you turn on the light. And, okay, but our baby has fallen asleep, then turn on the lamp please, but where's the switch, and then you can see five possibilities generated by humans, and then five possibility or more 10, then possibilities generated by these LLM dialogue GPT. So we are going to compare them with the style and they will not read them all, but some of them make more sense than than others, and so on. So here, for example, it's very visible that the length is quite is quite different but I'm not going to focus on length right so that the human ones are longer the other ones are shorter but that's not something that holds for all the lamps at all so, so that doesn't matter. Okay, but I hope the method is the method is clear right. So, then we are going to compare this these two things. Alright, so let's let's see what happens. First, we can use this just to check human production variability. Right so we just have our samples from the human process, and then we can just do this per wise measurement and check how much variability that is in the human language use. I'm just going to show it here for semantic variability so we are checking word embeddings of the production of the human productions. So we're going here at four different communicative, communicative situations, one is dialogue, like the examples that they have been showing. The other one is translation that I've also showed you some examples before. And then there are two other tasks, one is simplification that is text simplification so given a certain text a few sentences, then give me a simplified version of that text. Right this is useful for many many situations if you want to create a text that is more appropriate for certain age groups, or, for example for native speakers, right. So the context is the original text and then we generate a simplified text, and then there is also story generation. So here the prompt is the beginning of a story. So the generation task is to provide the continuation of this story. So it has to have some plot etc whatever you there are no no very concrete instructions there's a story generation. So what we can see in this plot is that the amount of semantic variability in this communicative situations when we are talking about human languages is quite different right and it goes according to the intuitions that I think we would have. So for dialogue and also for story generation that is a lot of variability because these are communicative situations where we have a lot of freedom with how we can continue. And so what can be said, it's very open. So we call them sometimes open and the tasks. Well the other two tasks. How did you quantify the body of it here. This is semantic variability. So we extracted sentence embeddings for each of the continuations. So it goes on distance. So it's a major pair wise or wise yes. That response to the same. Yeah, with the same context, the context is always fixed, and then there is a set of productions after this context. So, all the time this one has uni model distribution for each case. Yeah, actually, yes, because for the other measures that we look they are also unique model. Yeah, so these we are only looking at human language here right but it's super interesting in my opinion. Also, I didn't want to bother you with more information but when you look at the different types of similarity distance measures semantic lexics and tactic. So to see how this bar is across different communicative tasks is pretty interesting. So, yeah, so we can see here that for translation and for simplification with where what has to be said is determined by the context right so we have the sentence in the original language, or the original text that determines what needs to be said. So, this is much, much more constrained than the variability of the of how to, how to respond how to generate is much, much smaller right so that is, that is, that is not a lot of distance of semantic distance. All right, so this is what we see in the human processes in the human language processing. So now the question is, do our neural text generators or LLM reproduced this kind of human variability. So I will not show you a plot like this one, but I will show you well because when I would we are looking is we are measuring the statistical difference between the two sets of variabilities right. So this plot is showing you showing you the results for each of the different tasks dialogue story generation simplification and translation. Yeah, sorry that I realized here that the colors and much, but I hope that you can still decipher it. Right. So basically, when things are aligned along the zero, the in the middle. This means that the match is very good so that is not a lot of divergence between the two types of distributions right, and I think that in general, it's quite a line in the middle so mostly what stands out here is that for translation, the variability tends to be underestimated, which is interesting because for humans, this is also where the variability is less. So when we use an LLM. It is even less varied. Right. So, it will, it will just converge more on a very few on very few possibilities, right. And we can see also here that for dialogue, it's quite well aligned on the zero, but perhaps it over estimates the amount of variability. So it can even so we can go a little bit off the rails in the responses perhaps this is also what you saw in the examples of dialogue GPT that I showed you before. But I would say that overall. This is not so bad right so that overall LLMs, because we are here just evaluating a set of LLMs I didn't even tell you which ones. But overall, some LLMs approximate human production variability, relatively well with some over estimation but it's not that it's relatively good. Right. And curious about, like if you asked humans to now rate the, what was generated by the, you know, by the LLM for kind of like the subjective quality of their response. Because for example, if like, if I were to just rate, I guess, subjectively like the quality of the, you know, GPT's response. Yes. I feel like they wouldn't be that I wouldn't rate them that high. Okay, yeah, because of course because quality that's what it's, there are so many dimensions towards quality and what's a suitable generated text right. So yeah so these only, only taps on a particular dimension right so any additional evaluation on top of it of course would be very valuable. So to understand whether yeah this this variation also correlates with quality at other at other levels or not. Right. So, so yes, so we should not read more than what this is right so it's really just quantifying this variability, other things might still be pretty bad. So this variability on the human side is across many people. Yes. So and then this machine is a former one program. Yeah, so then that one issue is that like a sequential consistency. If I talk like myself in one sentence and other person in another sentence and that's totally from people that other sentence is totally strange. Right. Yeah, that's true. But yeah, I mean this is a very, very good point right so we are treating the LLM as as equivalent to a human population. Yes. Yes, but you would hope that if it is used in a conversation, then that the turn by turn context of the conversation is constraining things in a way that you still get at least the illusion that you're talking to a Korean agent but we know that this is sometimes not the case. So there are also quite some people trying to work on this consistency by creating personas and so on so restricting the behavior of the LLM. Yeah, but it's a good point we are treating the LLM as a population. All right so yeah so this is this is not to say LLMs are grading what they generate is just that they seem too much human production variability, relatively well. Okay so given this somewhat positive result. Now I want to go it farther into this kind of cognitive investigations and then say okay so can you now investigate. Use them also to investigate other more classic, but I don't know classic but the other more classic psycho linguistic questions. So this result seems to be on the positive side. And so one fundamental notion in in psycho linguistics has to do with quantifying processing effort human processing effort when when humans are using language. So language is actually effortful it involves some kind of cognitive effort to both producing language and also comprehending it. And the idea is that when we use language in conversation to communicate with each other that both speakers and addresses so in a dialogue we exchange these roles to speak as another see balance this effort in a kind of collaborative way. Kind of collaborative collaborative activity and speakers and addresses should balance, presumably do balance this effort right so the addresses those that are in the comprehension mode in a dialogue. I assume to actively predict what is going to be set next, right this prediction means doing some effort. And then we decide what we are going to say and how we are going to say it when we plan these, these aspects right, then we presumably take into account the processing effort on the side of the addressing. Right so the decisions that the speaker makes take into account the processing effort so this idea of effort is very central to how we use language and how we process language both from the production side and from the comprehension side. Okay. So there's been quite some work on how we can quantify this effort. And as you can see here that is also like a sort of equivalence between predictability, what is being how predictable something predictable something else, and the effort that is involved so things that are very predictable supposed to be less effortful and things that are very surprising that are not easily predictable, then are supposed to incur quite a lot of cognitive effort. So we want some ways of quantifying this effort for better understanding how we collaboratively manage this effort in conversation. How can we capture it, there are some established methods in psycho linguistics maybe some of you are familiar with them right that come from information theory, and that have to do with quantifying the information content or called also of words in context, right so this surprise all measures the probability of a particular word in context. And then you have to figure out how you compute this probability, but that that's one idea, right, that's what has been used across the board. There are different issues with surprise all I cannot go over them all right but for example it's computed at the world level so it's any extensions to other bigger units like for instance dialogue acts or other things in dialogue are a little bit tricky, and it has other problems so the problem so basically, even though it's a very useful notion. I think that there is quite some scope for trying to develop other notions to quantify effort that could be perhaps complimentary to surprise all. And a new proposal that we are presenting is to quantify utterance predictability that is the effort that it takes to predict something. As distance from possible alternatives. So, given a context, something has been said, how effortful is to process that. We can rationalize this as how distance disease, this is from all the possible things that possibly can could have been set in that context. And this, this, this, this proposed measure that we call information value can be can can be formulated by exploiting the generative capability of LLMs. So what we are proposing is that we can use the generative capability of LLMs to to operationalize this type of measure of effort. So let's see how we can do it so this is the quality information value and set of survival. And it is a measure of utterance predictability as a distance from possible alternatives from possible alternatives. So, we have a context as before, a possible dialogue context, I ate a pizza the other day. So what do you feel like eating today how about some burgers, and then there is a next utterance the next turn. I already had a burger yesterday. And basically what we want to quantify is how effortful it will be for speaker a to integrate these right to predict that this is coming in the dialogue. So what we are assuming is that somehow a set of possible alternative is being taken into account. And the predictability of this, of this utterance of this next utterance I is defined as how distant this name patterns is from this set of possible alternatives that they find what is plausible in this context. All right. Why is the particular utterance. Yes, I so in this case we know which with it is, and we want to quantify how effortful how predictable it is in this context. This is a sequence of interaction before and what is a subject. Yeah, so a here is the alternative set the set of possible alternatives that are not why of possible things that could happen set in this context. And that's what we call it the alternative set. Yeah, so this is a very abstract view of what I'm telling you here and now the question is how do we get this alternative set. Right, so this is what's in our data, how do we get this alternative set. And the proposal is that we are going to use LMS to generate this alternative set. So we use LMS to generate this set like we did before with the sampling from the LLM we give it the context, then we generate a set of possible continuations. So we have the alternative set generated by the LLM, the actual utterance for which we want to calculate predictability we want to calculate effort. And we can apply the distance measures that I told you about before like semantics distance for example lexical distance or syntactic distance and so on. And the idea is that if the distance is less than that's more predictable if the distance is more than that will be more surprising. Okay, that's what the that's what the model states, what the theory states. Okay, so how do we check whether this works, how do we check whether this method has any, it's cognitively plausible right so does it work to do things like this are we actually computing effort, computing human processing effort with this kind of methods where we generate this set with LLMS. What we can do to test this is to see whether when we calculate effort predictability in this manner, whether this correlates with psychometric data right like for example data that comes from humans reading text. So if you still have humans reading text and you are tracking the their eye movements, using an eye tracker. Then this gives you a notion of effort because when we read, we don't look at the words for the same amount of time or all words for the same amount of time so if words are very predictable we actually tend to skip them when we read with an even look at them. When words are more surprising or when addresses are more surprising, then we spend more time fixating on the words. So it is well known that reading times gives you a lot of information on what is effortful when processing. So we can see whether our estimates calculated with this alternative set and so on, actually correlate with reading times. There's also other type of data acceptability judgments as you may, there are some data sets like this that have asked humans to rate how acceptable the given continuation is right. It is great it is not just binary it's acceptable not acceptable. It's great it through a scale. Right so we can also check whether our estimates correlate with this type of acceptability judgments. So this will tell us whether the method works reasonably well. Okay so does it work then. So here we have the correlations, and we were quite pleased with the results that we got because we did obtain significant correlations with these things so you see the first two are data sets of dialogue of conversations where that with data about probability and then so we see that we have positive negative correlations which is what we expect right so less distance more, whatever I say before. And, and then for reading times, they are also a bit more moderate and some kind of low correlations but they are still significant, and we can see that this is quite competitive with surprises which is the standard measure and I will not go into details but it's not only competitive with surprises actually we show also that sometimes it can be complementary to what surprises is is comp is capturing right. So it looks from this results that this proposal for how to quantify effort where we're exploiting the generative power of LLMs does have some value for capturing human processing. All right. Okay, so in frame summary, I hope I can manage the second part we'll see. So we've looked at LLMs these are foundation models trained only on text. Okay, and we ask whether it makes sense to evaluate them through the lens of human language processing. And we saw that they can reasonably well reproduce the variability in human production in human production. And we have seen also that it is possible to define a method to quantify effort and that this method that exploits LLM generations by LLMs significantly predict psychometric measures of human effort. So in the last 10, 15 minutes if possible. Then I would like to tell you a little bit about some work with foundation models that don't look only a text, but that actually also have access to visual information that learn also from visual information. Okay. Are you still having energy for this. Yes. All right, then I'll go on. Okay, so it's quite accepted in computer science that human conceptual knowledge, linguistic knowledge but conceptual knowledge in general semantic knowledge is grounded in our sensory motor experience right so we use language as well and it's grounded in our language but that also includes information from our experience from our perception and so on. So to get to grasp the concept of banana, then it's very difficult to maybe just grasp it from what's in language even though we get a lot of information, but it's clear that if we have some visual information and some experimental information, experience information then we will we will have a more richer, a richer notion of what a banana is. Okay, so the trend. The trend is to try to move from LLMS that means only language to foundation models that also use other types of information so that we have richer types of models. And we have seen you may also have seen this how now GPT for for example also handles images and so on so it is a general trend to move in this direction. And then yes so these models I won't give you a lot of detail but basically they are also pre trained with lots of data but now this data is not only many many sentences, but also includes images and the training objective. Usually, it does include some sort of some sort of language modeling objective but also some way of aligning images and language right. There are many different multimodal models by now, so these are just a few of the current models that are many different types and they use different training objectives and different types of things and so on. But as I was saying, typically they will combine some language modeling objective of predicting missing words that they were saying at the very beginning, and some way of trying to find correlations between image regions and between and words. Okay, so these models in principle have access to more information. They, they, it looks like they may be more in line with also how we learn language because we learn language in in experiencing things in the world right so does it mean that they are better than language only models that they are better than LLMs. Of course it's very difficult to do a comparison because there are so many differences between them, but still think that we can we can try it right. What we know about these models is that they have led to a lot of increases in performance for many type of applications and tasks right so applications that just automatic image captioning or answering questions about images or searching an image from a sentence and so on and even even image generation which I'm sure that you have seen these examples right so definitely that's been a boom that's been an increasing performance but using these models right. It doesn't have a very good understanding of why there's been such an increase. So is it really because they contain more information because it's more, it's richer. So what is it, what is it that, that leads to this kind of performance right, and maybe we should also not be concerned only with how they improve things that maybe we should try to understand a bit better. But also the knowledge that they are learning in a more abstract way. So the question that that we've asked in a couple of studies is whether this kind of multimodal models, learn representations of language that are aligned with multimodal knowledge. Right and this is irrespectively of whether they give us gains in specific applications specific technologies. We can look at these two different levels. So we call this intrinsic evaluation because we are not evaluating with respect to whether automatic image captioning is better, but just intrinsically whether the representations have properties that that human representations have. So we can look at this at these two levels lexical semantics or the meaning of words or concepts, and also compositional semantics with linguist composition or semantics, which has to do with how the words are put together in sentences that this has to do with compositionality with sentence representation. Right. So, at these two levels, how aligned are the models with with human knowledge. Let's look first at words. One way to to get at how we represent words is to look at semantic similarity judgments between words. Right. So we may ask a bunch of humans to tell us how different pairs of words how similar different pairs of words are. Right. So, and, for instance, we would like to see that man and person are present, relatively similar and dog and airplane are quite obviously similar. And so, so humans may give us this kind of ratings here on the similarity of words. And then we can do the same with the representations that our multimodal models are learning, right, or that our LLMs with only language are learning as well. So each word in the model will be represented by an embedding by a vector, then we can calculate cosine, and that also gives us a measure of how similar these words are in the semantic space that the model is learning. So now we have semantic space by the humans, a semantic space by the models and again we can compute correlation. So this is a known method that has been used in NLP extensively, but, but only very recently, the comparisons between text only LLMs and multimodal models has been tried. Okay, so that's the method. Then we do an experiment with a bunch of human semantic similarity data sets, and a bunch of multimodal models, and then we also compare these multi model models to text only LLMs. So these are the results. These are the results of the correlation for one of these data sets of human similarity judgments. And what we can see here is looks pretty good right so all the models that are multimodally trained that are trained with images and language, give higher correlations and these are pretty high correlations of a 0.7. An LLM that is language only. Right, so it looks like it's in accordance to our expectations they have more information so it looks like it's pretty good right. Okay so that's good, but it turns out that when we look at all the data sets of these human similarity judgments, we don't always see this trend. And if we look at this one for example then it's the other way around. So actually using only an LLM gives us better correlations and overall not very high, not as high as the other ones, then using the multimodal models. So it doesn't work across the board right so what is happening what is the difference between these two data sets of human semantic similarity judgments. There are a few differences but one of them that we found is that these two data sets differ with respect to how concrete, the words are that the humans are asked to to rate, according to similarity. Right, my concrete I mean things like for example, done it and muffin are things that are very concrete because you can see them because you can grasp them and so on. While we have all the words in English that are very abstract like freedom and dreams and so on. Right. So some of the data sets have a lot of concrete words and some others less concrete words and more abstract words, and it turns out that there is a clear trend in the ones where the words are more concrete than the multimodal models do have an advantage and do better than the LLMs. And in the other ones the LLMs do better, which is well interesting right. So, so yeah so it's not that there is a clear trend across the board, but I mean there is a trend, but not an advantage across the board, the board. All right, so when it comes to lexical semantics, then yeah there is an advantage for concrete words for concrete words then this multimodal models can approximate quite well human similarity judgments. We look briefly at the composition of semantics. Right, so when we what happens if we actually put words together and not just words as conceptual knowledge. Right. How can we evaluate this. So we looked for inspiration we look at how the language abilities of children of your children are evaluated. The way in which this is evaluated is by, for example, given children coloring book like that. And then to figure out whether they can understand constructions like the active passive boys in English. Then they, they, the child maybe say the blue monkey scratches the green monkey and then asked to color things such that this is true. And then the blue monkey is being scratched by the green monkey and asked to color things right so this is a way of showing whether the child can match the meaning of a given construction to to a situation that the big step that meaning. There are other possible constructions and so on like coordination relative clauses and so on. So then we were wondering okay can we do a similar test for checking whether multimodal foundation models actually understand the meaning of the sentences where we are putting the words together, and, and they represent a certain meaning. So we created a data set for testing this kind of this kind of question. So, we created a data set with with many of these items for these three types of constructions active passive coordination and relative close. And for a given image that are four sentences associated with this image, two of them are false and two of them are true, but the tricky bit is that they all contain the same words. So, if the model is just good as representing the words as concepts in the abstract, it's not going to be very good at doing this right it needs to know how the construction creates a meaning by putting the words together. All right, then we design an experimental setup like this. So we give this that we give to the model, the image, we give the model one of these sentences, sometimes it will be true sometimes it will be false, and then we ask the model to say whether it is true or false. And we have a bunch of models some of them. So this classification some of them are more generative models in which case we have to prompt them so we have to say, given this image is the sentence, blah blah blah true or false and the model will say true or false. Well, all of us will just be classification models. So what do you think it's going to happen. What are they going to be. So this is a binary class right it's true or false so chance level is 50%. Any ideas. So how do you use the generative models for this matching tasks. So with a prompt. So these, these are multi model generative models you can give it an image as an input, and then you have to write a prompt because the output of this models will always be language generated right. So we just define a prompt that says given this image and the sentence and we just copied the sentence. It's true or false of the image. And then so some we tested several models and for example, some models were not able to always generate. Yes, no, or true false, they would go, they would generate some weird stuff and then we cannot evaluate right, but the more powerful models they would always give you a yes or no and then you can evaluate. Could you go back to the previous rise. So you're, you're talking about the discriminative model. Okay, from a machine learning point of view I discriminative model is basically most of the case better than generative case. But you might be surprised. Yeah. But should I show you the results because it's already five. The result is 50% and these are the results that we got. So humans do the task very well. It's not 100% because there might be a bit of noise in the data set for example sometimes maybe something is not fully visible in the image and there may be some uncertainty right so it's not 100% but it's almost 100% So the other results this is for active passive coordination relative close. They are all around chance level. So that was pretty disappointing. But you can see that the highest even though it's very low. It's this bleep to which is a generative model. Right. It's also more newer generation right the other ones came up a bit earlier so. So if it is also not a fully open model so it's difficult to dig a bit deeper and why that's the case. But, but that's what that's what we found basically doesn't surprise you. Sorry, can I ask you. So, so the number of parameters for the birth and be a little bit kind of similar or I believe it's much bigger. It is bigger. Yeah, here. Yeah. Yeah, it is bigger so the generative ones are bigger. And the others are in fact quite different in their architecture. And they're training objective. Yeah. Yes, but I mean, I, when I chatted about this with some with some people the other day and then you know so I invite you to test to take this data set and test the newer models like GPT for for example because I would be curious to see what it looks like right because this is pretty, this is pretty disappointing that this very basic sentences that preschool children can actually process cannot be processed here. Okay, so basically, yeah, at around chance level they cannot represent this kind of situational knowledge. I'm not sure I have time to just go through this slide I will just read just very briefly say that the results on compositional semantics are bad but the results on conceptual knowledge are more decent. Right, so we think that they may still be possibilities for using these multimodal models to investigate conceptual conceptual knowledge as it is represented in the brain. So predicting human brain activity with a lens is something that has been done right that has has been done in computational neuroscience, but using multimodal models is something that is still very new. There are some results, like from this paper here that showed that if you use multimodal embeddings to predict high level visual cortex activations when people are looking at images that then this gives you better results and if you use vision only from nation models right so that having some linguistic information helps you to predict activation in visual processing areas. So what we would be interested in looking is the other way around right so it's whether if you have multimodal models, and you want to predict activations when people are reading sentences for example. So this multimodality give you an advantage for this more linguistic processing area than using only language embeddings. Right, so these are the set of questions this is working progress and I will not say any further. Alright, so I'm finishing, I already gave the overview of the part and language only models as for the multimodal. These were the parts. These were the results right do they learn representations that are aligned with human linguistic information intuitions well it depends for conceptual knowledge. That's pretty good for compositional knowledge situational knowledge. It seems to be pretty bad. And as to whether we can use this information to help us figure out multimodality in the brain then I think that we'll see. As it looks, hopefully promising. I just want to end by thanking you all and also all my team members that took part in these studies and if you are interested in the details then these are the publications. This word comes in. Most of them actually to be presented next week at the MLP. Thanks.