 So this is really presenting a particular paper that we submitted at Cox High last year. So this is really at the intersection of artificial intelligence and cognitive science. It's a study into active work learning through self supervision. So we are modeling processes that also happen in human language learning, but we're modeling them using AI. And I work together with Afra Ali Shadi, my supervisor at CSAI, and Ali Liza Mahmoudi Kamalovac, who was a master student at the time, doing a research project with us, and is currently working on his PhD in Robot Assisted Language Learning. Okay, so I wanted to open with this tweet with a quote from Mian Lecun, who says long term progress in AI will come from programs that just watch videos all day and learn like a baby. Now, if you're a parent, I don't know what your opinions on this are, but this uses a lot of joking responses on Twitter, such as Jacob Eisenstein saying, as anyone who has spent even a minute or two with babies knows they learn exclusively by putting random objects in their mouth. Now, of course, this is a joke, right. But what I find quite interesting about this response is that in the second case, the baby is not just watching videos right the baby is actively doing something is actively exploring their environment right. So even though it's a joke there is a there is a grain of truth in there. And this is kind of the type of the idea that we are we are modeling in our research. So, in cognitive science. One of the areas that studies a lot is studying which words map to which objects in the environment right so it is very important of course to be able to talk about things in the world is you need to know the correct label for objects. And one of the ways in which this is often modeled is through cross situation and word learning. There are many different implementations but the basic idea is always, if you look at a lot of occurrences of objects and words in different situations, eventually you will be able to figure out which ones are are mapped to each other. Now, as I said, there are many different implementations and algorithms proposed for this but they, they all have in common that typically they the learner just happens upon language and visual input right so the learner just happens to see objects and get language and visual input. And what we're trying to do here is actually when you when you think about parents and their children, very often parents will will talk about objects that the child is already attended attending to we notice from language acquisition studies that this is the case, parents are very, very really try to follow their child's attention. So we thought okay, potentially children have the opportunity to shape their own language input right. So what if they do so, based on their current language knowledge, right. So what if they are particularly curious about objects that they don't really know the correct label for, for example, how does that shape learning so that's the basic idea. So we try to try to see if selecting input, according to some some some measure of what your current language knowledge is how that would, how that would shape the learning trajectory. So what is the task that we're, we're modeling, we're modeling. Basically a two part task. So one is figuring out which object is the correct reference for a label and the other is which is the correct label for, for, for an object, right. So we have a comprehension task and a production task. So what about our model is that we have these two modules, but they can actually feed into each other. So if we have a comprehension comprehension module sees a bunch of objects. Here's a word outputs outputs the object that I think is, it thinks is the most likely reference, then we can, again, feed this object to the production module which outputs a label. So this is a sort of introspective quality of the model that we can use both to train it in in a self supervised way and also to select input that is, that is relevant in the current state of knowledge. So basically how this self supervised learning goes this is a toy example in reality we use realistic images but basically we have scenes with a bunch of objects so let's say there are three objects in the scene, a duck, a son and a tent, and we hear word called duck, we hear the word duck. The production module tries to match the finder correct object that goes with this label and says okay this this second object this this son that is probably the, the reference is wrong right but it doesn't know that then we feed this object to the production module which outputs a label and produces a with with a label right so we have we put in this object and we have put a label. Now what can we, how can we use this for learning. Well we just compare compare the input at the start to the output. Right. So in this case the input to the comprehension module does not match the output of the production module. It doesn't match we need to we know that we made some mistake and we need to move away from this from this matching. We don't really know where the mistake is right, either the comprehension module made a mistake or production module, but in any case, in its entirety it didn't perform correctly. We're going to go a little bit more technical for those of you who are familiar with with our neural networks work. So, basically what we do is we have word embeddings that are random at the start, and we concatenate them with FGT representation for every object in the scene so we do this once for every object in the scene. We output one value per for objects. So basically we have kind of like an MOP right. So what we do this once for every object in the scene. So we have as many output nodes as there are objects in the scene. We can do, and we can interpret these as as the likelihood that this is a correct bearing of, of an object and a word. Now, then what we can do is we can input an object in the scene again represented as this VGG vectors which are sort of like high level visual features. And this production module outputs a distribution over the vocabulary. So, in practice what we use to train the model is, is the cross entropy between this, this output distribution here and the one how to coding of the word we input in the first thing. We back propagate through this whole thing and we can actually do that because we, yeah, okay, maybe I'm going into too many, many technical details but the reason we can actually do that is that we don't during training we don't input just one object but we input a weighted sum of all of the objects in the scene so we treat these, these values here as attention. That's the technical part. So now how do we use this two part thing to also select input right because that's the other part of this study. Now we sort of flip the modules. So what we do because we want to know for each object in the scene, would it be helpful to receive the label at this point right would it be helpful to receive language input for this one. So we, for every object in the scene we input the objects to the production module which outputs a label. Then we feed this label together with all of the objects in the scene through to the comprehension module which outputs an object again, and again we can, we can see whether this matches or not. Now how do we actually use that to see if this would be helpful. What we can do is we can look at the mismatch so basically just is this the same object here or not. And that that is what we call subjective novelty so if these two if these this mapping is very wrong, then subjective novelty will be very high right so it's basically the the average absolute difference between the output notes. And then we have a second measure which is called plasticity. And this is actually the highest for those items for which the learner is kind of like unsure so it doesn't it hasn't made up his mind very, very, very clearly. For those of you who are familiar with how neural networks are trained this is actually based on the, the, the gradient of the output activation function which means that if we do a single update for for a model that has this kind of pattern that would actually bring a very big change in in the, in the, in the output representation, let's say. Okay, so we have a measure that says whether we are wrong and we have a measure that says whether we expect to be wrong right that's the thing and a measure that says whether we are have made up our mind very clearly or not. And then we have curiosity, which is basically the product of the two. So it considers both these factors. And then we have another condition which is random. So we just get one of the objects in the scene at random. That's the one that we get a label for. Let's see how these training results, how these results compare. So, first of all, we see is curiosity model strange and a curious condition or performing the best. It's getting better than models trained in a random condition, but this is not actually the case for plasticity, and particularly for subjective novelty such objective novelty is really laying behind. Right. And another thing that we can see is that the standard deviation so this is this average toward 20 runs, the standard deviation for most in the curious condition is the lowest which means that most of the runs kind of converge to the same level right whereas for the condition there's a bit more variation and for plasticity as well. So the other comprehensive results production results I don't want to go into deeply, we don't really have the time I think but it's kind of shows the same pattern but the numbers are lower the reason is that we now have 4000 category. So, in other words to choose from rather than three objects that on average, right, so it's a much more difficult task. That's one of the reasons, at least. I wanted to show you these plots, which are the results after each epoch of training for all of the models that we train so we train 20 models in every condition. So these are on training data after each epoch of training. Let's first look at the curious curious models curious models are the orange ones so as you can see, they are both have the highest accuracy in training and and on test data, and also all of the lines are kind of like close together right so they kind of converge to this level quite robustly. So for for the blue one which is random. Some of the lines kind of converge to similar levels, right. But it this is not robustly the case so some of these guys don't really end up at the same level. And for this is seated pictures a little bit different but it's kind of similar. And another thing to see is that even at the very after the very first epoch of training already. It's sort of a leg up. Now you may wonder about subjective novelty which is the odd one out right this these green bizarre green lines here and they're performing exactly a baseline here. So what subjective novelty seems to be doing is it's very good for remembering your training data. It's not good at all for learning anything that generalizes. Okay, that was the results. So basically curious exploration does help learning faster more robust more accurate, but subjective novelty and specificity alone do not. Thank you.