 Hey, everyone, my name is Tim Brochet and today I'll be presenting a fully computational model of cochlear implant speech perception that me and my colleagues at the University of Cambridge have been working on for the last year and a half or so. First off, we'd like to thank the organizers of the conference for the invitation to speak and for putting together such a stellar program today. So the primary goal of our lab is to improve cochlear implant speech perception. And typically we'll do this by trying out a new processing strategy or stimulation technique in a group of CI listeners. But these in-person trials bring up a number of issues, especially if we go into take-home trials or extended trials that are really necessary for trialing processing strategies. So because of all of these issues, it would be really helpful to have a computational model of CI speech perception where we could trial processing strategies prior to experiments in CI listeners. So our research question is can we replicate phoneme level CI speech perception patterns using a computational model? And once we've done that, can we use that model to evaluate information degradation through the cochlear implant processing chain? And we can quantify the information loss and identify any bottlenecks. I'll start with an overview of our computational model. We begin with the basic cochlear implant front end with a microphone and some pre-processing. We do the fast Fourier transform, extract the envelope and each of those frequency bands use the envelope to modulate a train of interleaved biphasic pulses and then activate a finite element model with those biphasic pulses to calculate voltages at different locations in the cochlear. And then we can use those voltages to activate a biophysical neural model, which will help us generate these neurograms which show how neural activity changes over time. And then to evaluate speech information transmission through this model, we'll use an automatic speech recognition neural network or an ASR. So typical ASRs will be trained on spectrograms or other similar time frequency representations. And as you might guess, what we're going to do is train an ASR on our neurograms. And because the neurograms are a result of similar information degradation as happens in cochlear implants, we expect that our neurogram trained ASR is going to make use of similar phonemic cues to cochlear implant users. And perhaps make similar phonemic errors to CI listeners. This is our finite element model. It's a parametric model that's based on the average of 30 cochlea. And we include lots of anatomical elements. We've based this parametric model off the average cochlear spiral, the average cross section, and the taper along the cochlear. And then we input all of the conductivities of the different materials in the cochlear to determine voltages throughout the cochlear in response to electrical stimulation. These are an example of some measurements that we can make from the model. So this is the electric field as you change the site of excitation. And this is a voltage spread validation measure that we did. So we measured transimpedance matrices in seven CI users and used that to calculate voltage spread. And we can see those in the dots and the error bars here. And our model is with the lines here that are kind of all within one standard deviation of that CI listener data. We can use this finite element model to activate a neural model using those voltages that we extract from the model. So here in panel A, we have a wire frame of the finite element model. And the colors are the neural trajectory. And the cross section of this neural tissue is a neural trajectory for a single neuron. And we can extract voltages at each node of Ranvier for these modeled neurons to get a voltage profile on the neuron to calculate the activating function, the second spatial derivative of the voltage. And then to determine if an action potential is initiated on that neuron. And we model 1500 neurons along the cochlear spiral. And then we couple that with phenomenological models of refractoriness, adaptation, and temporal integration to account for some of the temporal dynamics of neurons. So by doing this, we can generate neurograms from electrodograms. And we do this for the entire timid database, which is a corpus of speech recordings that are labeled by phoneme. So they're really useful for training and testing ASRs. This is our automatic speech recognition neural network. The first segment of the network determines phoneme probabilities using neurogram input frames. And this is the causal neural network. And then the non-causal neural network adjusts those phoneme predictions based on adjacent frames. So this is our way of incorporating context cues into our ASR. Next, we can compare the raw data between the model and the CI listeners. So this is the rawest form of the output from our ASR. And it's the confusion matrix for these different consonants. And we can compare those to CI listener data, which was kindly shared with us by Gail Donaldson and Heather Kraft, from 20 CI listeners. And we can see qualitatively that these errors are clustering in similar locations along the diagonal. But we'll get more into a quantitative analysis of that down the line. We can also just look at the diagonal along those confusion matrices, which shows us the phoneme recognition accuracy for the model and for CI listeners. And we can see that the model is capturing between phoneme differences in perceptibility with the consonants that are most difficult for CI listeners to perceive are also the most difficult for the model to recognize. This dotted line is the line of equality and this solid line is the line of best fit. And they're pretty close and there's a solid correlation coefficient there. Now, in order to identify whether the ASR is making use of similar phonemic features and making similar phonemic errors to CI listeners, we need to do this thing called information transmission analysis, which goes into more detail about the transmission of particular stimulus features. And the features for consonants are voicing, which describes whether or not the vocal folds are vibrating during the phoneme. The place of articulation, which is the location of the vocal tract restriction and the manner of articulation, which is the degree of the vocal tract restriction, the nasality and the tongue and lip movement. And the unit of information is going to be bit, which is just a function of the number of categories for a particular feature and the likelihood of each of those categories. So for voicing, it's pretty intuitive. There's two choices. It can be voiced or unvoiced. It's a binary variable, so you have it's one bit and the voiced and unvoiced is pretty evenly distributed. And that's why it's one bit. And we can calculate the place of articulation, a manner of articulation similarly, and show that those are 2.2 bits for place and 1.9 bits for manner. And now the acoustic representation of these different features or the way that these different features are reflected in the acoustic signal is that voicing determines the pitch and the spectral weighting of a phoneme. The place of articulation determines the temporal fine structure and the manner of articulation determines the envelope. So we might expect from this or might hypothesize that manner of articulation and voicing are going to be transmitted better than place of articulation. And we'll look at the CI listener data first for information transmission analysis. So this is again that Donaldson and Kraft data showing that voicing and manner are at the highest proportion correct compared to place, which has a much lower proportion correct. In terms of which cues are prioritized for the recognition of consonants. We can't say that this means that the voicing cue is is more important. Actually, the transmitted information divided by the total information shows that the manner and the place cue are still the most important for CI listeners. And then we can evaluate how well the cochlear implant is transmitting particular features by looking at the transmitted information divided by the input information for that particular feature. So we see that for for manner, the highest percentage of bits are being transmitted compared to the voicing and the place cues. So next, we'll look at the ASR results compared to the CI listener data. The ASR we have a healthy condition and then different levels of neural degeneration. And we can see that the ASR results are following a very similar pattern to these CI listeners, where we get a high proportion correct for the voicing and the manner cue compared to the place cue where the transmitted over total information is highest for place and manner cues and where the transmitted over input is definitely highest for this manner feature. So this is we did a paired T test for each of these and show that there's no statistically significant difference between the ASR and CI listeners in terms of information transmission. So this supports the use of the model in order to evaluate a processing strategy at least prior to studies in cochlear implant users. So next, we wanted to look at information transmission at different locations along the CI processing chain. So we train it on spectrograms, train the ASR and electro diagrams on peripheral neurons, which do not include context cues and then on the non causal network, which also has context cues. And we can show that there's a small degradation of information from spectrogram to electro diagram, but that the bottleneck is really there at the electrode to neural interface, which is consistent with a lot of studies in CI listeners. And then some of that information can be recovered by the non causal neural network with the context cues. We can also look at specific features and how they degenerate at different locations through the cochlear implant processing chain. So here we show that the place cue is the place feature is the one that's losing the most information through the processing chain and that the manner and the voicing cue lose a similar amount of information to each other. And that for all of these features, that electro to neural interface is the main bottleneck. So in conclusion, the model predicts between consonant differences and recognition accuracy. Heirs tend to cluster around the manner of articulation. There were no significant differences between the model and CI user data for the information transmission analysis. Manner and place features contain the most information while manner and voicing cues were the ones that were best transmitted by the CI. We identified a bottleneck at the electrode and neural interface. And now we are using the model to replicate studies in CI listeners. For example, some studies about the number of active channels we were able to replicate pretty well. Some studies about site selection strategies were also able to replicate. And we're looking to replicate other ones. And we're really at an exciting time where we want to start trying out all sorts of new processing strategies and new types of hardware in our finite element model. So we're looking forward to maybe collaborating with some groups where they can use our model and we can discuss new processing strategies and things like that. So I'm really looking forward to your feedback here. Just before I leave, I'd like to acknowledge the lab that I'm working with and all these people who have contributed to this project. Yosef Schlittenlocker helped out with the ASR. Ewan Roberts has done a lot of work on the finite element model. Tobias Göring also helped with the training and testing of the neural networks. And Chen Jiang helped with the physical models that we used to validate the finite element model. And Debbie Vickers helped with the information transmission analysis. And Manahar Bantz is leading this whole laboratory. So yeah, thanks a lot for sticking around and listening to our project and looking forward to your questions, comments.