 Hello, my name is Anil and in this video I will present an efficient DNN-based approximation for a state-of-the-art auditory model. This work has been a collaboration with Ian Bruce. An auditory model simulates the transaction of an acoustic signal through the auditory periphery up to the auditory nerve. It's an important tool for understanding the mechanisms of normal and impaired hearing, which would not be easily possible with psychophysical experiments. In addition, auditory models can be used in speech and audio processing for instance as a front-end for speech recognizers or as a pre-processor for hearing devices. However, typically an accurate simulation of the neural representation is computationally demanding which renders real-time application infeasible. Therefore, several approaches have been proposed which approximate auditory models by deep neural networks. In this work, we present a wave net-based approximation scheme for the cochlear processing and hair cell transaction stages of a widely used auditory model. Here we consider the auditory model by Zilani and Bruce. The model emulates the auditory periphery by a static middle-ear filter and a non-linear and time-varying cochlear filter bank. The resulting in a hair cell receptor potential can be further processed by a stochastic synapse model and a spike generator, which simulate action potentials in auditory nerve fibers. In this work, we discard the synapse model and spike generator and only consider the inner hair cell receptor potential for normal hearing listeners. An example for the modeled inner hair cell receptor potential can be seen on the right-hand side for a frequency of 500 Hz and a frequency of 4 kHz. While this model provides a good simulation of the auditory periphery, it has to be operated at a sampling rate of at least 100 kHz to eliminate air-leasing effects caused by the model non-linearities. This comes along with an increased computational complexity which prevents the application of the model, for instance, in real-time scenarios. For the DNN-based approximation of the auditory model, we use the wave net, which is a fully convolutional network topology. It is composed from T stacks, each of which has L layers, and it uses residual and skip connections to improve the stability of training. At the output, the network has two additional layers with adjustable activation functions. Originally, the wave net was developed for an autoregressive generation of speech samples. However, in this work, we slightly modified the network such that it predicts the inner hair cell potential for C characteristic frequencies simultaneously and in a feedforward fashion. In this figure, we see an example configuration using four layers at one stack. It takes an audio signal at the input and computes the inner hair cell potential for all considered characteristic frequencies at the output. Training and validation of the wave net model was performed with clean speech, noisy speech and music using the Musen database. For testing, we used various datasets which were not used for training and validation. We considered a wide range of signal to noise ratios and sound pressure levels. The inner hair cell representation was generated for the whole database at a sampling rate of 100 kHz using the original auditory model. Both the input and inner hair cell representations were then resampled at 16 kHz to be used for training and testing the wave net. In total, we considered 80 characteristic frequencies. As an optimization criterion, we used either the mean square error or the L1 norm of the error both in combination with either a tanh or prelu activation function. To achieve better approximations in particular at lower sound pressure levels, we applied a symmetric logarithmic compression with an adjustable compression factor D to the inner hair cell potential before training the wave net. After decompression of both the original and the predicted model outputs, the approximation accuracy was measured in terms of a signal to distortion ratio for age characteristic frequency which was then averaged across all frequencies. Here we see the approximation accuracy for the inner hair cell representation in terms of the SDR measure as a function of the input sound pressure level. From all considered combinations of loss and activation functions, the L1 loss in conjunction with a prelu activation function and the compression factor of 100 turned out to perform best in terms of SDR. Throughout all sound pressure levels, this setting leads to a better approximation than for the uncompressed case. As a result, we see here time frequency representations of the inner hair cell potential for a test music signal at different sound pressure levels. We call these representations IHCO grams in analogy to spectrograms. At all sound pressure levels, the wave net based approximation yields highly accurate predictions of the original auditory model output. It should be noted here that this is possible at a sampling rate of only 16 kHz whereas the original model has to be operated at a sampling rate of at least 100 kHz. This reduces the computational complexity considerably as we will see later. We also tested the trained wave net model with pure tone signals which were not part of the training data. These diagrams depict the amplitudes of DC and AC components in the resulting inner hair cell potentials for different tone frequencies and at different sound pressure levels, both for the original model and the wave net based approximation. The approximation model closely emulates the behavior of the original model regarding the contributions of DC and AC components for pure tone input signals. This figure shows the processing times for a 10 seconds long input signal as a function of the segment length which corresponds to the algorithmic delay of the system. Executing the reference MATLAB NC code of the original auditory model in parallel on four CPUs requires around 50 seconds on a standard PC irrespective of the segment length. In contrast, the wave net model requires only 11 seconds of processing time for the same signal if it's split into segments of 750 ms. This corresponds to a reduction of processing time by a factor of 5 using only a single CPU. By using a GPU in a standard PC, the processing time can be reduced to around one second. When the model is executed on a GPU server using a GPU with a larger VRAM, the given signal is processed in 0.2 seconds, which corresponds to a reduction of processing time by a factor of 250. To summarize, we proposed a wave net based approximation of the cochlear processing in hair cell transduction stages of a widely used auditory model. The wave net model is computationally efficient, provides accurate predictions of the inner hair cell potential and is fully differentiable. This opens perspectives for training DNN based audio processing schemes using the wave net auditory model as part of the optimization criterion. Also, this work has shown that the approximation accuracy becomes more accurate, especially at lower sound pressure levels if compression is applied to the inner hair cell representation. This allows an improved training with a wider range of sound pressure levels, making the model more robust against variations in input levels. In future work, we plan also to integrate hearing impairment and neural transduction mechanisms in the wave net model and to develop DNN based hearing loss compensation schemes. Finally, I'd like to thank you for your attention.