 Artificial intelligence has been the field that has struggled to build computational models that can perform feats of intelligence, but it hasn't really been concerned as much with explaining brain or behavioral data. So it's really at the intersection of these fields that I think we can make progress. And it's kind of a long history of trying to meet more and more of these criteria. And these previous attempts have not completely succeeded, but I think now is a very exciting time to be a neuroscientist because we are in a position now to put these pieces of the puzzle together. And there's a particular modeling framework, neural networks, that I think will be key to all this. It's a modeling framework that's central to this interaction that has a long history in each of these fields. In the 80s it caused a revolution, a paradigm shift toward ideas of parallel distributed processing and cognitive science. It's got a long history in computational neuroscience at many different levels, some of which we heard about earlier today and in other talks at this conference. So a lot of the work in computational neurosciences is at a very detailed level with spiking models or very detailed biologically faithful models. But the kinds of neural networks that I'm going to be talking about are at a very abstract level. You could think of them as modeling the rate coding level. And these networks also have a history in artificial intelligence and they're currently, as many of you I'm sure have heard, revolutionizing really several domains of artificial intelligence. So they're very exciting from the perspective of what they can do in terms of task performance. So I think this is kind of a perfect storm for the next 20 years or so for us to really engage higher level complex brain processing with explicit models that perform the tasks. So I work in vision. So what does this mean for visual object recognition? Well, in my field, the goal is to build a biologically plausible network model that can recognize novel object images. So it has to be able to perform the task of object recognition. It has to be a computer vision model and predict their neuronal representations and human behavioral responses. So that's sort of the smaller story that I'm going to tell you is about how we got around to beginning to use these models in my lab and to test them with brain and behavioral data. So we present visual object images to our subject. So here's a set of 92 images that are isolated object images from lots of different categories. We present them to our subjects while measuring their brain activity. Subjects can be humans or monkeys that we measure with fMRI or cell recordings. And then we analyze the representations in those brains and the representations in computational models in a single analytical framework that I call representational similarity analysis or RSA. So here's how this works. We present each image to the brains of our subjects and in a given region of interest for each image we get an activity pattern which we consider a representation of that particular object. So for another object image we get another pattern here in the same region of interest and that's the representation of this object right and the difference between these is the representational difference in that brain region. We can play the same game for the models. The models have internal representations as well. So again for each image we get a unique fingerprint which is the activity pattern elicited in the internal representation of the model by that image. Now we want to relate these representations in order to address the question whether the model is a good model of this particular brain representation for example a visual area say visual area v4. This is difficult at the level of the original responses because we don't know the correspondency between all the units of the model and the measured response channels for example voxels in fMRI or the single cells in neural recordings. So one approach to this is to fit a linear forward model to predict each measured response as a linear combination from all of the units of the model. However as we engage very complex models they can have hundreds of thousands of units and then that is fitting a lot of parameters for each of your measurement channels. So this is the reason why we take a different approach. We look one level up at the level of the dissimilarities between these activity patterns. We make a matrix indexed vertically and horizontally by the stimuli where for each pair of stimuli we can look up the dissimilarity of these two stimuli in the representation. We call that the representational similarity and this is the representational dissimilarity matrix or RDM and this is our signature of the representation which tells us what pairs of images are similar according to this representation and what pairs of images are distinct. So what distinctions this representation cares about if you will. The signature of the RDM is easy to compare between different representations because it's indexed by the stimuli so we've abstracted from this confusing multiplicity of single responses and we can very straightforwardly compare different representations now between different brains, between different species but importantly for today's talk between models and brains. So we have a toolbox on this the RSA toolbox which provides methods for statistical inference on brain computational models. So how does this work? Here's a little simulated representational dissimilarity matrix so here you have all the stimuli or the other stimuli and the cold colors here are things that are similar so you see two clusters here in the simulated example and then we have a number of model representational dissimilarity matrices and now it's just about comparing these model representational dissimilarity matrices to the one that we measured in our brain region of interest. So we use a rank correlation to minimize the assumptions made about that relationship in order to assess the fit of each of these models and then we use permutation and bootstrapping techniques to do inference, frequentist inference on these comparisons. So you see that most of these except for the last one are significantly related to the simulated brain representation here and in addition we compute noise ceiling which tells us given the noise in the data and the intersubject variability what is the level of performance that we expect for the true model and this noise ceiling has a lower bound and an upper bound and these are lower and upper bound estimates on the performance of the true model and then we do pairwise comparisons so we statistically compare each pair of models usually by bootstrapping the stimulus set we consider the stimuli a random sample from a population of stimuli so we can do inference comparing these models and then we draw a horizontal line whenever two models are significantly different so here these are lots of lines because there are lots of significant comparisons there are some missing actually if you look very carefully but in general we have a lot of power there and we need to correct for multiple testing which here we've done by the false discovery rate. So with these tools we can address whether a model explains significant representational variance whether it explains significantly more representational variance than another model whether it explains all the non-noise representational variance so when it hits the noise ceiling then it represents the data rdm as well as we would expect for for the true model given the noise and the intersubject variability in the data and we use frequentist non-parametric inference controlling for multiple testing and random effects tests across subjects and across stimuli to generalize to the population of subjects because usually we're not interested in our particular group but we want to use our data also to support generalization to the population and also across stimuli because usually we're not particularly interested in the particular set of stimuli we've chosen but we're interested in making more general statements about the computational models and we want these computational models to perform for any random sample of stimuli and that's really the target of the inference. Also support searchlight rsa where we perform these region of interest based analyses all over the brain and do inference at the level of these maps again correcting for the multiple tests across brain locations in that case and we use a range of distance metrics for measuring representational dissimilarities including the linear discriminant t value and the cross nobis distance which bridged the gap between rsa and linear decoding analyses where you can think of rsa as kind of a generalization of linear decoding where you do the decoding for all pairs of stimuli so you get a very rich and complete characterization of the representational geometry if you will. So in 2008 we applied this technique to this image set that I showed you in the beginning and we got these representational dissimilarity matrices for human inferior temporal cortex and monkey inferior temporal cortex where human IT was measured with fmri voxels and monkey IT with cell recordings and you see these clear clusters of animates and inanimates there's a human face cluster here there's an animal face cluster here the animal faces all have different colors and shapes so this is less of a tight cluster that might explain the fact that it's not such a tight cluster and there's also very high similarities between patterns elicited when one of the patterns is elicited by a human face and the other by an animal face. So the question this post so this was this categorical structure was strikingly similar between human and monkey and all also within these categories there's a lot of correlation between these two matrices as we showed in the initial paper however the question that this really posed for the purposes of the present talk is can we explain the IT representation with a computational model. And to my graduate students say at Khaliq Ratsavi who's here in front of his his new home he's left the lab and is now a postdoc at MIT so he his PhD thesis was about comparing the IT representation in human and monkey to many different computational models so how similar are computer vision representations to IT that was the first question he asked so he took lots of models from computer vision different hand engineered computer vision features to compare them to IT. So here I'm going to plot for you the accuracy of human IT dissimilarity matrix prediction so as before that's the RDM correlation between the brain region and each of the models and let's look at one model so this is the gist model this is a particular set of computer vision features based on summaries of Gabor filters and you see that it performs it explains significant variants in the RDM of human IT it's highly significant so one could tell a story about this to some extent right this clearly shows that there is information about stimuli in this brain region you could use this information to decode this is a very popular method to decode information in brain regions and there is a particular encoding model here that explains significant variance however before we interpret this we should look at other models so here's 26 other models all computer vision features and we see that they all almost all of them perform highly significantly and they explain some explain a little less variance other explain a little more variance but overall they all explain some variance the important thing to look at is the noise ceiling which is much higher so this shows us that there's a lot of variance left unexplained by all of these models so our conclusion from this exercise was that all these 27 models fail to explain IT since many of the models explain significant variance we were interested in testing whether combining the models might help us explain the IT representation imagine each of the models has some of the features present in IT but none of them has all all of them so if we combined these features maybe we could do better so we remixed and reweighted the representation in order to best explain IT always cross validating across images it's totally uninteresting to us when we overfit to a set of images because this means that the model that we have doesn't really work as a computer vision model it wouldn't do the task on a new set of images so we always cross validate we use the separate image set to create these linearly remixed filters where we fitted three support vector machine discriminants to emphasize the major categorical divisions that are known from the literature to be prominent in IT and then we reweighted the representation using non-negative least squares where we assigned one weight for each of the model layers and one for each of the three SVM discriminants so that's kind of a low parametric way of finding the right mix of features to explain the matrix and again always cross validated across the images so when we did that this is the matrix that we got you can see that it gets this tight human face cluster just about right but it doesn't get any of the other sort of obvious categorical features here again for comparison the human IT and the monkey IT matrix so it misses a lot of these categorical divisions it doesn't get the animal face cluster where the faces are much more visually dissimilar and also it doesn't place the human and the animal faces in a single cluster together so this also didn't work suggesting that even with combinations of computer vision features it's not possible to explain IT so this was around 2012 when something happened in computer vision namely neural networks overtook computer vision features that are hand engineered at computer vision so just as a quick reminder neural networks have a long history started starting perhaps at least as early as the 1940s with McColloch and Pitz's binary neurons then in the 60s there was a lot of talk about perceptrons and their abilities and limitations in the 80s there was the big parallel distributed processing revolution in cognitive science with Rumelhardt and McLeland but in the 90s neural networks sort of lost steam they didn't work as well on real-world problems and computer scientists largely lost faith in this modeling strategy they thought other shallower machine learning techniques such as support vector machines had better mathematical theory and also less problems of training time complexity and work better in practical applications also but more recently there have been breakthroughs with deep learning in the mid-2000s Hinton Benjio and Lacan researchers who believed in the usefulness of deep hierarchies and the sort of ultimate superiority of deep hierarchies kept at it and solved the technical problems and it turned out that these limitations that people faced before were not fundamental limitations but they were just hurdles to be overcome and they were overcome in this period and in the last five to eight years perhaps with growing computing power and large labeled data sets this has led to major advances in computer vision and also in many other AI applications using both feed-forward and recurrent neural nets where increasingly you have these networks and technological applications and they're invading our our cell phones for for computer vision and language understanding and even translation at higher semantic tasks so here we tested a particular kind of deep neural network a deep convolutional neural net so how does that work it has these little filters and they look like the bore filters here so it takes a little local filter and then convolves the entire image with that filter producing a spatial map where each unit detects a feature of this shape in the image at that location and then this filtering operation is repeated for different particular filters so you get these multiple maps of the image that highlight different features then a static non-linearity is applied often that's a rectified linear activation function where the output of the filter is set to zero if it's negative and otherwise it's just passed through and that's often followed by some kind of pooling either local max or average pooling and local normalization and these steps together are considered a layer and then there's multiple layers of this type and the network that i'm going to show you there's five convolutional layers and then a couple of fully connected layers following that but people have since moved on to much deeper networks up to you know between 100 and 200 layers so here the results for the deep supervised convolutional network which was trained by Krzyzewski et al and this is the network that convinced people in computer vision to switch gears and start using neural nets and there's been no looking back in computer vision since so when we look at the first layer it has a very low performance at explaining the it representational geometry however it is significant but it's far from from the noise ceiling so it's clearly not not a good model for it so here to give you a bit more detail on the noise ceiling the upper bound of the noise ceiling is the highest accuracy that any model can achieve given the fact that what i'm plotting here is the mean correlation across subjects and all the subjects are different so there is some rdm sort of at the center of the cloud of points that correspond to the single subjects that has the maximum mean correlation and no model no matter what rdm it predicts can exceed that right so that's a hard upper bound and then the lower bound is just using the other subjects average as a model so that's basically saying let's use all the other subjects data rdms as our stand-in for the true model but because of because that's a noisy estimate it slightly underestimates the performance of the true model so it gives us a lower bound estimate on the performance of the true model so here for comparison again is the performance range of the computer vision features so as we go up these layers where we get into this range of the computer vision features and then we get significantly better in the higher layers of the network but we're still some ways away from the noise ceiling here so again we asked and and these these representations are significantly better than the the early representations so again we asked can we play the same game that we tried for the computer vision features and remix and reweight the features of this very rich representational space in the deep net to better explain the it representation so we followed exactly the same process we trained three svm discriminants on separate sets of images to emphasize the relevant categorical divisions and then we used non-negative least squares to train one weight for each layer and one weight for each svm discriminant so when we did this for the first time we got a matrix that looked really qualitatively very much like what we've been looking at for inferior temporal cortex for a long time here again for comparison the human it in the monkey it matrices you see that the major categorical divisions are now right the animal face cluster the overall face cluster the animate inanimate distinction and also within these categories there's a lot of correlation between these matrices so when we look at that model it still doesn't completely invade the noise ceiling here we don't have a lower bound because this model is overfitted to a set of subjects so this lower bound doesn't work for this but even if it did work it would be slightly slightly below right so there's still unexplained variants but this does better than anything we've we've ever seen and significantly so so just as a methodological note here all of these models explain significant variants and that's essentially asking whether there's mutual information between stimuli and responses of course there is and that's usually what we go for when we use decoding so from my perspective decoding is not very useful to learn something about computational mechanism the important stuff is up here this is what tells us something about the computational mechanism it's the comparison of the performances of the different models to the noise ceiling and the comparisons between alternative computational models so in the last minute I want to just show you briefly that we can also use these models to predict behavioral responses because ultimately of course we want to explain behavior as well and this is work from postdoc Jan Schauer who's now a lecturer at Birmingham University so Jan had subjects categorize our image set along several divisions categorical divisions I'm only going to show you animate and inanimate so he gets an average reaction time I'm showing you the subject average here we're looking at individual subject analyses and average analyses well for every image we get a reaction time in this animate inanimate categorization task and then our reasoning is that perhaps the way categorization works is by setting up a readout filter for animate inanimate categorization from the it representation and then accumulating evidence using that readout filter somewhere in the frontal lobe so for some images the evidence on this readout dimension might be strong leading to fast accumulation of evidence and for other images it might be weak so it might take longer for the evidence to accumulate and the reaction to occur so when we have a representational space here and this could be a measured representational space from ephemerio cells or it could be a representational space in a computational model we can fit this decision boundary we get a decision value for each of the particular images and this enables us to plot the decision value against the reaction time and we expect that when an image is far away from the decision boundary then the evidence should be accumulating very rapidly and the reaction time should be short so here i'm showing you this for human it measured with ephemerio and we see that there's a correlation here so the images the particular images that are far from the decision boundary in the human it representational space are the images that subjects can respond more rapidly to and that's a highly significant correlation we can play the same game for the deep convolutional network where we fitted a linear svm readout for the fine finally the final fully connected layer of the network and this works even better possibly because the measurements in the deep convolutional network are not noisy measurements but they're sort of perfect it's just computed from the images and we can predict the reaction times for individual images with even greater accuracy than from their the subject's it representation so just to sum up there's an emerging literature using deep nets to explain brain computations it started in 2014 with these three papers the other two are from Jim DiCarlo's great lab at at MIT and last year there was a very good paper from Marcel van Gervens group and there was a number of preprints floating around it's a very fast moving field and i want to leave you with one central message that i think of as most important from my talk which is about the novel feature of this literature i think the novel feature of this literature is that the models perform the tasks in this case object recognition under natural conditions explaining brain activity as well as behavioral responses thank you very much thank you nicko for a very elegant presentation beautiful slides um questions for nico perhaps i can start off and ask you one question nico about the neuro informatic neuro informatics uh challenges that you've been that are attending these kinds of approaches how much are you hitting ceilings in terms of computational capacity um computational capacity is not our main limitation at the moment so we bought a number of gpu's and they work very well for us so one interesting thing to discuss technologically is hbc versus gpu what is the trade off there for us so far it's looked for these applications like gpu's are the way to go for the near future at least so more generally i think we need to move away from the cottage industry style of doing neuroscience so in in my field uh we always do essentially the same experiment we show some visual stimuli and we measure responses and then we analyze them multivariately and more and more people in my particular field are realizing that this is quite a standard experiment so it's somewhat standardizable and making the data shareable so we're thinking about um how to move forward with this so that every lab doesn't have to necessarily do all of the steps acquire the data and analyze the data and do the modeling and draw the conclusions but we can come together as a larger group and pursue this collaboratively so we're very nice talk and uh i want to ask you most of the models the computational models that we use are quick forward right yeah so um we already know that the visual system is full of recurrent connections so can we really go to the high level of your the high level message that you're giving here can we really explain something about the brain did we ignore all of those recurrent connections yeah so the answer to the question as you posed it can we explain something about the brain is yes i think these these uh networks capture something about what's going on in the brain namely the feed forward sweep so they go some way toward explaining the feed forward sweep and behaviorally rapid categorization however uh recurrent processing indeed is our obsession in my lab in particular and we're training recurrent neural networks also this new engineering literature is very much driven by exactly this intuition and has had great successes for example machine translation and in a number of applications with recurrent neural network models the earliest successful model was the long short term memory but now there's gated recurrent units there is all kinds of more complex memory mechanisms that are the neural Turing machine that are being explored in the engineering literature and this is uh i think where the most exciting aspect of this whole new literature i would say so i very much agree with you that we need to incorporate that and that that is is the the grand challenge really yeah question do you think the some deep knit with models say anything about learning of object categories in the brain yeah so this is not how we use them because i don't study plasticity i study perception so i'm interested in how do these computations work when you see an object you know in the first second after uh the the object becomes visible to you that's my object of study so to me the way that these models are trained back propagation is might as well be just a hack for setting all these parameters and transferring the rich world knowledge that an intelligent mechanism needs to perform the task into the model right however there's also a very interesting literature on the biological implications of um different uh learning rules including more biologically plausible learning rules such as stdp and heavy and learning and you know unsupervised learning techniques but even back propagation right there there's so so this old debate is back propagation biologically plausible this uh there are arguments on both sides this is being revisited and my intuition is there will be very short sighted to dismiss it too easily this is a very complex and interesting uh discussion there is a great review that i would recommend by marble stone wane and curding on how the brain might use cost functions and if with using different cost functions in different parts of the brain to train itself in a way that is similar at least functionally to to back propagation and this links this engineering literature to the neuroscience literature in a very deep way i think okay i think we'll call it a stop there we have a lot of opportunities for questions in the discussion section once again thank you very much it was fascinating