 The first artificial neural network was designed to serve as a bridge between biophysics and psychology. Frank Rosenblatt called it the perceptron. Nowadays, machine learning can do much more than simple neural networks. In this talk, I will share with you three different strategies to use machine learning for hearing models. Machine learning instead of a hearing model, machine learning for a hearing model and machine learning as a hearing model. Two years ago, we were the first to use knowledge distillation for perceptual model. What is knowledge distillation? It was originally developed to replace a complex ensemble of neural networks with a single neural network. The ensemble acts as a teacher and the single network as a student. The teacher is accurate but slow and the student shall be as accurate but also fast. Here we have the Cambridge Loudness model. Like many good perceptual models, it is accurate but complex and slow. It considers the transfer function through the outer and middle year, calculates excitation patterns at the cochlear and several other aspects of loudness. Most of the computation time is spent on the excitation patterns and converting them to loudness. But all that this block in the middle basically does is to convert the spectrum to a single value. It's nothing but a non-linear regression from a high dimensional input space. And that is something where neural networks are very good and extremely efficient. And you can use this strategy for other hearing models as well. They just need to regress from one or two spectra to something else, for example speech intelligibility. We wanted the neural network to be simple so that it is fast. We used a multi-layer perceptron with only three hidden layers and each of them had 150 units. As activation functions we chose rectified linear units. So all parts of this network are simple but they are fast and they are good for regression tasks. If we allow parallel processing we can predict 24 hours of loudness within minutes with that neural network. That's a huge improvement from the reference model which took up to 50 days for the same task. But even with a batch size of one we are still much better than real time. We probably could already deploy it in a hearing aid or other small device and with hardware being specifically developed for AI it will only get better. By comparison it's quite unlikely that somebody develops hardware specifically for excitation patterns. In terms of accuracy we think that we have a good approximation if our errors are less than one decibel. We computed the loudness for about 2 million spectra with the original loudness model in order to provide target values. After that we trained the neural network on speech and on artificial sounds. We evaluated the predictions of the neural network for various sounds like aircraft, cats, rain other environmental sounds music but also speech because it's so important for our applications. So we evaluated it on several sounds that it had not seen before. In terms of levels most of the speech sounds and environmental sounds had loudness levels around 60 to 80 phone quieter segments did occur but less often. We set the music to a higher level in order to see how well the network generalizes to those higher levels. And here are the results. Each panel shows more than 100,000 data points, so you mainly see the outliers. The top row shows the results for clean speech and for speech of lesser quality both of the lib with speech corpus. The bottom row shows the results for the environmental sounds and for music. The outliers are mainly at the lower levels that occurred less often during training and that are probably less important for the applications. The root mean square error is less than half a phone for each category and that is pretty good and corresponds to considerably less than the one decibel that we had as a target. In audiology we are primarily not interested in a model for the average listener. What we want to know and what we need to know are the individual parameters for hearing impaired person so that we can use that personalized hearing model for our applications and active learning provides mechanisms to determine those individual parameters efficiently. A simple model of a hearing impairment is the audiometric threshold and that is probably the model that is used most often in clinical practice. Most sophisticated models take into account things like loudness compression also as a function of frequency hair cell loss auditory filter shapes or edge frequencies of dead regions. I will talk about dead regions because that test is an interesting example. A dead region is a region in the cochlear where all inner hair cells have died or have never been present. Its edge frequency is more complex to determine than a threshold because the responses do not directly give information about the model parameters. On the bottom right you see three psychophysical tuning curves for three example sets of model parameters. There are master levels and frequencies that are a good choice to discriminate between those curves so we should test them. And there are also points where the curves overlap so if we test there we don't get lots of information. The task in a dead region test is to tell whether a tone of fixed frequency and fixed level has been heard or not. We know that the subject can hear the tone in quiet but we mask it with a band pass noise or fixed width. Given the sound parameters X which are the master level and the mask frequency and the model parameters theta which are the edge frequency and the broadening of the auditory filters at that place we can calculate the probability that the tone is heard. In each trial we do not just want to query around 50% detection probability we want to present the master level and master frequency that are most informative about the model parameters. We choose that based on mutual information. Mutual information is the uncertainty of the response minus the expected uncertainty when the model parameters are known. So we want to probe around detection probabilities of 50% but only if the uncertainty stems from a disagreement between model parameters that we previously find equally likely before we obtain the response. We don't want to query there because we already know that the threshold is there and we have we want to discriminate between candidate models and not flip a coin. After we obtain a response we update our belief about the model parameters and we repeat the procedure. We start with some prior belief about the model parameters which you see on the top left. In our case we apply the uniform prior across the model parameters the yellow area. Based on the high signal frequency and an audiogram that we've taken before we can already exclude lower frequencies as candidate edge frequencies. The choice of a prior bears some ethical considerations too. We could choose an informative prior and reduce average testing time further but this could disadvantage those with an uncommon hearing loss. For scientific experiments we like a uniform prior because it does not introduce any bias. For clinical tests we need to make sure that outliers can still be tested within a reasonable testing time so we must not make a rare hearing loss too unlikely in our prior even if healthcare systems and insurance companies mainly care about average testing time. The middle panel gives our current belief about the expected detection probabilities for a given mask integrated across model probabilities and the right panel shows the mutual information. So for the first trial any mask level and frequency in the yellow-orange area would be a good choice. After 30 steps shown on the bottom right we're quite confident that the edge frequency is around 13 cam which is around 700 hertz. It works even better in simulations but most importantly it worked very well for our 14 human subjects. The three open symbols for the three runs are always close to each other which shows that the method is consistent. Three open symbols are also close to the method that we chose as a reference and when they weren't close other methods weren't either. That indicates that our methods method is both fast and accurate. Third part machine learning as a hearing model we're not the first to use automatic speech recognition for this purpose but we made some valuable contributions. This is what an off the shelf system looks like. You install Linux you feed your sounds into a black box and you get a score. In our approach we wanted to shed some light into the black box and connect several blocks which set up our inspired by the physiology of our brain. We wanted a meaningful output for nemes that is why we used the timid database. We want for nemes because when optimizing a signal processing strategy of a cochlear implant hearing aid or hearable noise we would not know what to do when we are just told what the word error rate is. But we know what to improve and where to improve when we are told that for example s and f are confused and that is why we work with confusion matrices and measures like mutual information. You can download our ASR system from my github it can be run on any platform in your usual python environment and it is free. Let's assume we want to predict the phoneme at a certain point in time indicated by the orange line. First we transform the signal to a spectral representation just like our cochlear does. The next step that happens in our brain is the accumulation of auditory information across time. We model that by a causal neural network that converts the spectrogram before the current time step to probabilities for the current phoneme. It does not use any information of the future it is just like accumulating information that happened before. We do this for every time step to get this representation of phoneme probabilities over time. The yellow bit just before the current time step indicates a high probability for t and we are transitioning to something that could be u or e. It is not absolutely certain directly after the orange line. When decoding a longer sequence we use our memory. When asked for the phoneme at the orange line you would probably not only look to its left but also to its right. We model this by a second neural network that is centered at the current time step and that improves our phoneme predictions. Again, we get a representation of phoneme probabilities over time. To transform those phoneme probabilities to a sequence we use a hidden Markov model and the sequence that I randomly chose says terrible to witness. In our whole system the speech recognition part is only the second part. I need to emphasize that it works on any input that looks like a spectrogram. It can be the output of a first principle shearing model like the Mohr model, Bruse model, Zwicker model and even of Manohar's group developed a finite element model of the cochlear that we can predict speech recognition for electrical stimulation 100% computationally. Just briefly about the architecture of the networks. Both networks have two layers with gated recurrent units. The network does some average pooling at the start so that it can have a bigger window size and predicts the preceding and next phonemes too so that we can make the hidden layers consider the context. We achieve a phoneme error rate on timet that is comparable with similar architectures. We didn't optimize it further by more regularization because we wanted to focus on a reasonably quick training. We're also going to replace the hidden Markov model with an attention layer. If you are a student who is looking for a remote summer internship and are interested, then please get in touch. Here's the accuracy after the causal neural network for different input window sizes. It already works quite well when we have only 5 time steps at a step size of 2 ms or 10 ms window but of course it improves when we consider more parts of the signal. When we process those outputs through the non-causal neural network with a window size of 600 ms, the accuracy becomes the same for all window sizes of the causal neural network. That was to be expected because the output of the causal network with small input windows still provides useful input features and they obviously got corrected by the larger context. We interpret the differences between the two performances as a measure for the listening effort, so the amount what the memory needs to do. We can manipulate the processing abilities in a temporal and spectral domain. If we use only a fraction of the input notes which has the same effect as using a bandpass filter we confirm experiments from the literature that frequencies between 1 and 2 kHz are most important. Maybe more interesting a comparison between the ASR predictions and data from actual cochlear implant users. I don't want to go into more detail here and just emphasize that the green bars for the CI listeners are similar to the bars that were produced by the ASR. Tim will present more about this and our whole pipeline to analyze cochlear implant performance in a few minutes. Let me conclude with our contributions to machine learning for hearing models so far. We were the first to use knowledge distillation to speed up a hearing model. We did so successfully for loudness and others did so for speech quality, speech intelligibility, neural activation patterns. We developed Bayesian active learning tests for dead regions notched noise tests equal loudness contours which allows us to determine individual parameters of complex and sophisticated hearing models. Furthermore we developed an easy to use ASR system and we use it to optimize the signal processing and strategies and cochlear implants and other hearing devices. More will follow I'm already looking forward to seeing you again.