 All right, let's do this. Thanks for having me. So I run a research group at MIT called the Lab or Computational Audition. And our long-term goal is to build good predictive models of human hearing. And by that, I mean, we would like to end up with a computer program that can take sound as inputs, and then can make all the kinds of judgments that a person can make about sound. We think if we're successful in that enterprise, it will transform our ability to make people hear better. So from where I sit, the peripheral auditory system is pretty well characterized. And we and many, many others in the field work with what are now fairly standard peripheral auditory models, one example which is shown in this picture. Now, there are lots of variants of these kinds of models. And depending on your purposes, you will work with different variants. And there's certainly lots of important development continuing in this space. But most of the time in our lab, we are thinking instead about what happens downstream. And over the past five years, we have been pretty intensively engaged in asking whether we can obtain better models of the downstream auditory system using machine learning. And this effort has been energized and enabled recently by the fact that it's now become pretty routine to attain human level performance on perceptual tasks using artificial neural networks. So these are models, the basic ingredients of which have been around for a very long time, since the 80s or even earlier. But due to developments in speed of computers and innovative memory and this space as well as some tricks that have come about, these kind of models have started working really, really well over the past eight or nine years. And they involve the repeated application of a bunch of simple operations, so filtering or convolution, pooling, normalization. And the key to making these models work well is that we now have very effective methods for tweaking the parameters of the models in order to optimize them to classify the input signals. So if you have a big set of speech signals and you have labels for what the words are, you can adjust the parameters of the model so that it will correctly recognize the speech. And over the last eight years, this has produced a complete transformation in the way people do engineering. This is what is known as deep learning. So there's a few different ways in which deep learning might be used in hearing research. One is to build models of auditory behavior. So this is a situation where the neural network gets sound as input and produces some kind of behavioral output that will typically be a classification of the sound so to recognize it in some way. This is useful because you can compare the model to human behavior, and I'll tell you more about that later. There's also the intriguing possibility that the model stages might map onto stages in the brain. And in our lab, we pursued this a little bit. This is one example paper. Another potential use is to model part of the auditory system. So here, the neural network would get sound as input and would output something that would attempt to replicate or predict a neural response. So there's an interesting reason paper from Sarah Verhal's lab where she made a more efficient cochlear model using a neural network. Neymar Asgarani's lab has also been using try to predict responses in the auditory cortex measured with ECOG. This kind of enterprise is often limited by the data that's needed to train the model, but it's nonetheless an interesting direction as well. And finally, there's lots of applications that deep learning for audio transforms where the neural network takes sound as input and then changes it in some way. So it was to do de-noising, for instance, or source separation. And there's a huge literature and engineering about this. So my lab has done a little bit of number three and a fair bit of number one. And so that's what I'm gonna focus on in this talk today. All right, so the question is, can we obtain better models of the auditory system by training systems to perform tasks using these kinds of methods? And so the general approach that we've taken in our lab is to hardwire a model of the cochlea to be faithful to what we know from biology on the grounds that we know a lot about what you do in that stage of the model. And then we try to learn all the subsequent stages of an auditory model with a neural network. So the parameters of the model will be optimized to perform one or more tasks. The output units of the model are to be interpreted as class probabilities when each one might correspond to a different word. And then the result will be a candidate model of the auditory system. So the plan for today is to tell you a little bit about some examples of recent successes we've had with neural network models appearing as well as a discussion of the challenges and opportunities in the domain. So some of the take home messages are that after training on natural auditory tasks with natural sounds, we find pretty good matches to human behavioral experiments. And I would argue unprecedentedly good matches to human behavioral experiments. This is true in a bunch of domains, speech recognition, noise, sound localization and pitch perception. I'll show you how this can provide insight into the origins of human behavioral traits. And then I'll conclude by pointing out that the models still have lots of shortcomings that we need to better understand and address. So as an initial example, I'm gonna tell you briefly about the very first thing we did in this domain, which was to try to build models that could recognize speech. So to do this, we took excerpts of speech, superimposed it on background noise to create a speech and noise recognition task. So this is an example stimulus. So this is an excerpt of somebody saying gross domestic product crew, superimposed on noise from a train station. The task here is to identify the word that occurs in the midpoint of the clip, which in this point would be domestic. Gross domestic product crew. Okay, hopefully you can hear that. So somebody said gross domestic product crew. All right, so classically, this is something that would have been a pretty hard task in speech recognition, but the kind of thing nowadays that normal networks are pretty good at. So we built a model that could do this task. The methods that we used to do this are right out pretty standard. So we use something called back propagation in order to learn the weights of the model. So back propagation is essentially an application of the chain rule from calculus, where you can compute the partial derivative of a loss function with respect to a parameter of the model. And the loss function will be some measure of how well the model is performing the task. Now the other thing that is very important when you build these kinds of models is that there are a whole bunch of choices that get made with regard to the structure of the model. So you can see here that the model is defined by some number of layers. Each layer does some kind of operation. The operations are defined by a whole bunch of hyper-parameters, so like the size of the filters, for instance, with the nature of the pooling. And all of those choices can actually make a big difference on how well the model actually works. And so if you're doing something in a new domain where you can't just take a model that somebody has used for something similar in the past, you typically need to do some kind of search over those kind of hyper-parameters. And so we do that. So there's sort of various procedures we employ to try to optimize that to some extent. The other thing that I'll say is that we are typically working in domains where the phenomena are defined over relatively short time scales. And so the inputs to the model are relatively short sounds on the order of one or two seconds. And so the model will just get the whole sound at once. We typically are neglecting the direction, all the time and effects of memory, which I know that important and interesting things that should try to model. And our initial work in this area was led by Alex Tell, who's a grad student in the lab. He's now a post-doctor at Columbia, and Dan Yamins who's a post-doctor in our department, he's now a professor at Stanford. But what I'm gonna show you to start out with is a behavioral comparison of human listeners and the neural network model that we train on the same experiment. This is an experiment where people heard speech and noise. So there are different types of background noise and different signal-to-noise ratios. And this was led by Erica Shook, who was an undergrad in our lab. He's now a grad student at Columbia. So this is how humans do at this task. The graph plots proportion correct on the y-axis versus signal-to-noise ratio. And unsurprisingly, as the SNR improves, people do better, but you can see that certain kinds of background noise are easier than others. So it's a lot easier to recognize speech in music than speech in speech babble. All right, so this is just what humans do. So this is the results of testing the model on the same task. And this is what it does. And there's really two things to take away from this. The first is that the y-axis limits are the same here, which means that the model is performing as well or maybe a bit better than humans. But the second thing to note is that the relative difficulty of the conditions is pretty similar for the model and for humans. So the model also does better with speech in music and more so for speech in speech babble. And one important point that I wanna emphasize, and this will be true of everything that I will show you in this talk, is that the model is not fit to match human behavior, all right? You just take the model and you optimize it for this task, and then any match that you see between the behavior is a side effect of optimization for the task. So there's no fitting to human behavior. So this was the first result of this nature that we got that really kind of got our attention, made us think that this was worth pursuing. And we've since seen analogous results in a bunch of other domains, one that we're pretty excited about at the moment is sound localization. So I think everyone in this audience kind of knows the classical story that there are three main types of cues to a sound's location. There are level differences and time differences between the two ears, as well as spectral cues that vary with the elevation of a sound. So this is in all the textbooks where we all teach us in our classes and stuff. But in actual real-world environments, there's always noise, there are reflections that come from the wrong directions. And so localization is a pretty hard problem. And so the models of sound localization that we typically work with usually can't actually localize sounds in real-world environments. It's kind of a hard engineering problem. So Andrew Fransle, who is a student in our lab, tried to build such a model. So he got around the need for lots of labeled data by using virtual training environment. So we took recordings of natural sounds and of noise sources and then used a room acoustic simulator to render the binaural audio that would enter the ears of a person that they were at, for instance, this location with the sound source here and with some noise sources here. All right, so that puts out this stereo audio that then gets passed to models of the cochlea and not provides the input to this neural network that has to output the azimuth and elevation of the sound. All right, so we trained this model up in the virtual environment. And one of the first questions that we wondered about was whether this would be sufficient for actually giving us a model of the work in real-world conditions. And the way that Andrew tested that was by making stereo recordings from the ears of a Chimar mannequin in our lab space. You can see the mannequin sitting here and the speaker here and he moved the speaker relative to the mannequin so as to get a test set. And you can see that this is really not particularly controlled conditions. And that was the point, right? It's just a regular room and there's air conditioning as noise sources and services that are providing all kinds of reflection to stuff. All right, but the cool result that came out of this is that the model transfers really well to this real-world test set. So this is actual position on the x-axis and the judge position on the y-axis and you can see that things are falling along the back. So we've now got a model that can actually localize sound, which is cool. All right, now many of you are aware that spatial hearing and sound localization is an extremely well-studied topic. And so there's this really beautiful and rich literature with all these experiments that kind of document the characteristics of spatial hearing in humans, okay? So there's these graphs that are on the slide are showing you a bunch of examples and if you know spatial hearing, you'll be familiar with most of them. And I don't have time to go into the details of each and every one, but what I'm gonna show you are the results of taking each of these experiments and then running our model on them. And that's what's shown at the bottom. And so if you just do like a visual comparison at the top and the bottom, hopefully you can kind of see that the model is reproducing all of these different characteristics of human spatial hearing. Again, with help being fit to match human data in any kind of way. Okay, so you may wonder, all right, well, what's the scientific value of this? So we've got a model that can reproduce human behavior, but what do we learn from this? I mean, in some of these cases, maybe for instance, duplex theory, people arguably had like pretty strong and reasonable intuitions beforehand about why for instance, people use time differences for low frequencies and level differences for more for high frequencies, okay? But that kind of intuition is not there for a lot of phenomena in hearing, okay? I mean, one of my favorite recent examples are the limitations of sound localization and the presence of concurrent sources. So this is a graph that's from an experiment by Jean and Yos from a few years ago, where human listeners were presented with some number of sound sources that were played from different speakers in a speaker array, right? So there can be between one and eight sound sources here. And so the humans had to point to the sources or somehow indicate their locations and report the overall number. And this is a graph that plots the reported number of sources as a function of the actual number of sources. And what you can see is that humans are pretty accurate for up to about three sources. And thereafter, they really drastically undershoot. So it seems that sound localization is kind of limited at least in a lot of conditions to hearing about three to four concurrent sources. But it's not obvious why that should be the case, right? And you might think that maybe there's some human specific cognitive constraint that limits that like maybe people can only attend to like three things or something really unclear. But the really striking result is that Andrew ran his model on the same experiment and you get a very, very similar result, right? So the model just like humans can really only localize up to maybe three or four things. And that suggests that this is an intrinsic limit related to the available information. I have to do it the way I guess the way the cues interfere. So it's something fundamental to the problem. All right, so this is a phenomenon that really didn't have a very good explanation that at least that I've ever heard prior to Andrew doing this work. And we now believe that this is really just something that falls out of the properties of sound localization itself. All right, another behavioral comparison that we have made is in the domain of pitch perception. So here, Mark Sadler and Reagan-Zahler has built a model of pitch perception by optimizing a neural network to estimate the fundamental frequency of sounds, in this case, trained on speech and music sounds superimposed on noise. And once trained, they then asked whether the model would reproduce a lot of the key characteristics of human pitch perception. So these five graphs are five classic experiments in human pitch perception. If you work in pitch, you'll know each one of them. They each measure the ability to discriminate or estimate fundamental frequency as a puncture of some stimulus parameter. And what I'm gonna show you now are the results of running the model on the same set of experiments. And hopefully just by eyeballing it, you can kind of tell that qualitatively and in many cases, quantitatively, the model reproduces the properties of human pitch perception. So I would argue that this is a pretty significant advance over previous models in that we're getting human-like behavior. And this is true in realistic conditions with comparable accuracy to what we see in humans. In many cases, we're seeing similar psychophysics across a pretty wide range of experiments indicating similar use of cues. So this is an indication that human-like behavior very often emerges simply from optimizing for natural tasks. And one of the interesting things that we think we can do with this is investigate the conditions that produce human-like behavior. And so as one example of that, I'm gonna show you a result of this nature in pitch perception. So I just showed you how this model that is trained on speech and instrument sounds reproduces all these different results in human hearing and to test whether the learned strategy that is embodied in the psychophysical results is adapted in some way to the natural environment, we instead took the model and trained it on unnatural synthetic tones. In this case, these were tones with high pass spectra. And so I'm just gonna flip back and forth between two sets of results. So this is the model that's trained on natural sounds. Here's the model that's trained on synthetic tones. And you can see that you get completely different characteristics from a model that again is trained to estimate fundamental frequency, but with sounds that have very different statistical properties. So this suggests that the characteristics of human hearing depend to a pretty striking extent on the auditory diet that the system is optimized for, either over evolution or development. So another related application relates to the longstanding debates over the role of timing in place information in pitch perception. So this is a picture of a model of the auditory nerve in response to a harmonic complex tone. And there's place information here where you see these peaks corresponding to individual harmonics. But there's also this phase lock in your timing information at least up to around four kilohertz that is signaling the frequencies of the individual components. So there's lots of arguments about the relative importance of timing in place in pitch perception. And we can address this in the context of our model by removing the timing information, by imposing a low-pass cutoff in the hair cell in this auditory nerve model providing input. So you can see the place information remains intact, but the temporal information is gone. This is an extreme case. So these graphs are showing the psychophysical characteristics of models that were optimized for these different kinds of cochlea. And you can see the red curve has what we think is the correct cutoff of 3,000 hertz and looks pretty human-like. And then as you lower that cutoff, the graphs look less and less like what you observe in human hair, which suggests that access to phase locking is actually necessary to get human-like pitch perception out of these models, suggesting that that's put human over a long one as well. All right, so what we're seeing here is that trained neural networks reveal the performance characteristics of task-optimized mechanisms. And kind of the big picture is that we are thinking of these models as conceptually similar to ideal observer models. Now, of course, they're not provably ideal and that's an important distinction. But on the other hand, they're applicable to domains where deriving an ideal observer would be intractable. So they really allow us to move into domains where we wouldn't have been able to do this before. So another application that we're excited about is using these kinds of models to understand the behavioral deficits in hearing impairment. So there's gonna be a talk in the next session by Mark Sadler, who's a grad student in our lab and I'm not gonna steal his thunder, but I'm just gonna give you a little teaser here for his talk, where what Mark has been trying to do is to take the models that we have been developing of normal hearing where there is a cochlear model that provides input to a deep neural network and then simulating hearing impairment by altering some component of the cochlear model, for instance, of eliminating the contribution of the outer hair cells. Now, there's two different kinds of models that Mark has been building. One is what we call static. And the idea there is that you have your model of normal hearing and then you swap in the impaired cochlear. So you can maybe think of this as sort of analogous to somebody late in life developing hearing impairment and their brain is no longer plastic and they just are stuck with the new set of, the new kind of cochlear that's impaired. But there's also a plastic model of impairment where you retrain the neural network with the impaired input. And so Mark has been investigating and whether either of these kinds of models will reproduce the behavioral significance there. So check out his talk for that. So there's plenty of challenges that you encounter when you use deep learning. So one of the big ones as I have already alluded to is that supervised training is very data hungry. So you typically need hundreds of thousands of labeled training examples. This is not always obtainable. So there's certainly been lots of work done to build up corpora of labeled sounds but for lots of tasks that you might be interested in but that might not be available. So we've been trying to use virtual training environments as a way to circumvent this but that's kind of a whole enterprise in itself and sometimes hard to pull off. And also storage can be expensive. So my students are always asking me for more disk space and that can be a headache. Another challenge that you often run into is that the whole enterprise is pretty compute intensive. And one reason for this is that if you're working on some new problem you often need to optimize the model architecture which in practice means you have to train a whole bunch of different kinds of models to find one that works well. So we use a lot of computer time. So there's also some important limitations to deep learning at least as we currently do it that are important to keep in mind and interesting to think about. One of them is that the learning is pretty clearly unrealistic. So these models require huge amounts of labor data. Humans pretty clearly don't. So when a child learns to recognize speech their parent is not walking around saying car, car, car, car, car, car, car, car, a thousand times in a way that the model was dead in. Maybe here once or twice and then the kid kind of gets it just by a lot of passive observation. The models also really deviate from biological learning and exhibiting what is often called catastrophic forgetting. And this refers to the fact that if you train the model on one thing and then you train them on a second thing they typically forget the first thing. So this is pretty different from biological learning. And a really nice example of this comes from this classic study from John Ben Obstow's lab that many of you probably know in which we have attended the model with our model of sound localization. So in this study they brought people into the lab and first asked them to localize sounds. So there's this grid of sound locations here and has an elevation and the black solid lines are showing that people are pretty accurate. So this is when people are just using their normal ears. And then what they did is they put these plastic molds in the participants ears and then retested them on a localization task. And you can see that although localization in azimuth was largely preserved, localization elevation collapses. And so this suggests that people have learned to use the particular spectral cues that are in their ears. All right. And we tested our model on the same experiment by swapping in a new set of HRTFs. And when you do this the model reproduces the results of that experiment pretty nicely. So with the transit of HRTFs you can pass them out and localize an elevation but if you swap in a new set of HRTFs then localization and elevation collapses even the localization and azimuth is kind of there. All right. Now the binocular paper had these two other results. So one of the results is that they convinced these people to take these molds home with them and like wear them for about a month. And periodically they brought them back into the lab and tested their localization. And they found pretty remarkably that over time people learned to use their new ears. So localization was restored, right? And that would also work with these kinds of models. So if you retrain the neural network model with a new set of HRTFs it would also learn to localize just fine. But then there's a fourth result in the binocular paper which is really amazing. And that is that when they had concluded the experiment they asked the people to remove the plastic molds and then they retested them immediately. And what they found is that people had retained the ability to use their original ears to localize sounds. So although they had learned this new set of spectral cues they took out the molds and it's like they had still retained a knowledge of the original cues. All right. And that definitely would not work with these kinds of neural network models because of this phenomenon of catastrophic for better. All right. So other limitations are that although we call them neural networks they're really not very neural. So the so-called neurons don't obey Dale's law. So the brain neurons tend to be excitatory or inhibitory. Whereas these models typically have both positive and negative weights on each individual unit. So they're not really not well suited to circular model is not right now. And then finally, there's also these really interesting curious discrepancies with biological systems particularly for input signals that are derived using knowledge of the model. And the most well-known example of this is what's called adversarial examples. So these have been known about for a little while now. And they refer to the fact that the models can be fooled with targeted perturbations that are imperceptible to humans. So if you have the model in hand you can take some signal that is one kind of thing and you can make this very small perturbation to it and the model will think it's something completely different. This happens with our models too. So this is a speech signal that the target word in the middle is 20th. For most of the 20th century. Okay. And we've made a very small perturbation to the thing. It sounds, it's gonna sound exactly the same to you but the model thinks now that the target word is track. For most of the 20th century. So this is a big security concern. There's lots of interest in it. It's still not really very well understood. It seems like a pretty significant discrepancy with humans. And another such discrepancy that my lab has been looking into involves what we call model metaphors. This is worked by Janelle Feather. So this refers to the fact that there are situations where the model will hear wildly different stimuli as the same thing. So this is an example where the speech signal contains the word program. It's a job security program that prevents lab. And this is something that to us sounds completely different. In fact, it sounds like noise, but the model again thinks that it has the word program. All right. And so this suggests that some of the model and variants is diverged pretty significantly from those of humans. All right. And if you wanna learn more about this there's a paper that we published on this. Okay. So just to summarize, I've told you a little bit about our work building new models of the auditory system using deep learning of audio tasks. I showed you a bunch of examples of what I think are pretty compelling matches to human behavior with real world sounds and tasks, as well as matches to many classic psychophysical results. I've argued that this can help us understand hearing impairment. I would encourage you to check out Mark Sadler's talk in the next session to learn more about that. I showed you some examples of how we can get insight into the origins of behavioral traits by revealing the intrinsic limits of performance under different kinds of conditions. But then I concluded by pointing out some of the very significant remaining discrepancy between these models and biological systems, which we think likely must be addressed if we're going to fully capitalize on all of the exciting model applications. So I just wanna acknowledge all the great people in my lab that I have the opportunity to work with. Alex Kell and Dufrancil, Mark Sadler, you know, Feather, Eric Shook, Reagan-Zalas, Dan Yamans, as well as our collaborator, Sang-Jang Kai-Chit Khan, Guillermo Clarke and Alexander Maudry. And thank you for your attention.