 Hi, everyone. Thanks for being here. I'm very happy that I can present our research here at VCCA. My name is Bernd Meyer from the Communication Acoustics Group in Oldenburg. And I'm presenting results from a collaboration between Jana Rosbach, Saskia Röttges, Christopher Hout, Thomas Brandt and myself. And we are very much interested in models that predict human perception, specifically speech intelligibility. And in the following, I'll introduce a blind model for binaural prediction based on automatic phoneme recognition using deep neural networks. So let me start with a short evolution of the models that we looked at in the Medical Physics Group and in the Communication Acoustics Group. So the first one is BISM 06, BISM for Binaural Speech Intelligibility Model, proposed by Beutelmann and Brandt, which uses an intrusive front end and an intrusive back end. A more recent development is the BISM 20, which replaces the front end with a non-intrusive version. And the model that I would like to cover in this talk is called BAPSI, which combines non-intrusive front end and back end. And BAPSI stands for Binaural ASR-Based Prediction of Speech Intelligibility. And I didn't introduce ASR, that's automatic speech recognition. So actually, we're using parts of a speech recognizer. And so our motivation is that previous models are able to predict a reference data, but they receive separated speech and noise. And as we want to move out of the lab and into real-life applications where this is not available, we think that blind models are desirable and we focus on this aspect. The metric that we use for evaluation is the speech recognition threshold, the SRT, which is the signal-to-noise ratio at which 50% of the words are recognized. And we compare our data to two baseline models. So on this slide, I'm introducing the data that we're using for evaluating BAPSI. This data comes from the study from Beutelmann and Brandt in 2006. The stimuli were used for SRT measurements using the Oldenburg Sentence Test, EULSA. We used speech and noise signals, which were convolved with measured head-related room impulse responses to create Binaural Scenes. And in these Binaural Scenes, the target speaker was always placed at zero degrees in front of the listeners. And we have noise placed at various angles in this scene. And we explore three different rooms here, an echoic office and cafeteria. And the noise signal itself is stationary and it has the same long-term spectrum as the speech signals. It's also presented at ethics level of 65 dB SPL. Eight normal hearing listeners participated in the measurement. Signals were presented via headphones in a listening booth. And the SRT was determined through an adaptive procedure that varies the signal to noise ratio. So we end up at the 50% word recognition rate. So on this slide, I would like to introduce the baseline models. The first one is the BISM 06 that I already mentioned. It receives the clean speech and the noise as input and then uses an intrusive equalization and cancellation, or EC stage, to exploit the Binaural information that is provided. And the output here is the predicted speech intelligibility, which is estimated using the SII as a back end. The second baseline is a modified version of the HUSB, so of the hearing aid speech perception index, which was extended with a very simple better ear listening. It receives the clean speech and the speech mixed with noise as input. And the model combines the coherence between clean and noisy speech with the correlation of their envelopes. And to implement better ear listening, we just get a speech intelligibility prediction for each ear. And then we use the one that produces the higher SI. And this results in the final SI for that baseline model. On this slide, I'm showing the first component of the proposed model of Babsi. So in this front end, we use the mixed signal of speech and noise, which is used as input to a gamma tone filter bank, which is shown here. The gamma tone filter bank then passes over to two different processing stages, better ear processing for frequencies from 1,500 to 8,500 hertz, and then an EC stage for lower frequencies. The EC stage performs a level normalization and phase correction. And then signals are either subtracted or added up, which requires to make a selection about the signal. And also for the better ear component, we also need to decide which side to use. These decisions are made non-intrusively using the speech to reverberation modulation ratio SRMR. And the thinking behind this is that the resulting signal should be similar to speech as close to speech as possible. We then put back everything together in a recent step here. And then we come to the back end. So we're actually not using the SII in this work. So here's an overview of the back end that we're using, which uses standard components of an automatic speech recognizer. We use the mono signal, single channel signal that I've just shown, which is converted to features, mad spectrogram features. These are used as input to estimate which phoneme was produced for each time frame that we're observing. And just to illustrate what is happening here, here you see phoneme probabilities over time. So black means a higher probability for a clean signal or the same signal in minus 10 dB noise. And if you compare the red and the green probability vector, you can see that in this left case, the probability vectors are very different. So the activations are clearly in different entries of the vector. Whereas on the right side, the noise smears everything out. So the red and the green vector become kind of similar. So distant frames become more similar. And this is what we capture with a mean temporal distance, which is an entropy-based distance measure proposed by Inier Kamanski. With this, we can estimate the SRT. So what we do is we calculate the mean temporal distance for 20-oilter sentences at different signal-to-noise ratios. These values are arbitrary. So we kind of need to link them to actually actual speech intelligibility scores. So we compute a reference MGD with 100-oilter sentences that are not used during the actual tests. And we then select the SNR that results in the reference MGD. And that we use as an SRT. So on this slide, I'm showing the results of our study. So in each plot, you see the SRT over the location of the noise, so the angle of the noise. And a very difficult condition is when the noise and speech are co-located, so 0 degree, where we see that the SRT are elevated. The grain line here, that shows the subjective data from the human listeners. And the line with the diamond markers here, that's our model, which is a bit off in this case. And it kind of hovers between the HUSB plus better here, which is this graph with the S-risk. And the BSM06, which is the other line here. For the other conditions, we can check NACOICs. It's a bit harder to see because our model quite nicely predicts the subjective responses. And the same is also true for the office condition. In NACOIC and office, we see correlations between subjective data and our model prediction of 0.98. And from the cafeteria, it's still 0.74. And if you're interested in the details of the results or in more details about, for instance, the training of the DNN, then please check out our recent ICAS paper, which I've referenced here and which is freely available on the web. So of course, there's lots of future work to do. We need to evaluate the model for a broader class of acoustic conditions. So only two things that come to mind are other speech material. So there's only used matrix sentences or dynamic acoustic scenes. We could directly predict the speech intelligibility instead of using the SRT. And then, of course, it would be very interesting to investigate the prediction for hearing impaired listeners. With this, I'm at the end of my presentation. Thanks for your attention. And hope to see you soon in real life. Bye-bye.