Topics in Audio Information Processing Research

Assistant: Dipl.-Ing. Matthieu Kuntz and the AIP team
Turnus: Winter and summer semester
Target Group: Wissenschaftliches Seminar zur fachlichen Ergänzung (MSEI, MSNE)
Doctoral seminar
Schedule: 2 SWS
Exam: oral
Time & Place:

Does not take place in WS2020/21, next regular course summer 2021

Thursday, 09:45 - 11:15 hours, N6507

Dates: Start on 23.04.2020 (topic selection), no seminar on 14.05, 21.05 and 11.06


The seminar is targeted at advanced students, PhD candidates and post-docs in the field of audio-information processing. Scientific publications on current topics in audio-information processing are presented in a small group and discussed in depth ("journal club"). Each participant will present at last one publication (usually two) and lead the discussion. To prepare for the discussion each participant will read the material prior to each seminar meeting. The focus of the seminar is on understanding and discussing the content. Participants get to know current topics in audio-information processing, train the comprehension of English-language scientific publications and practice scientific discourse as well as leading a discussion.

Previous knowledge expected: Leture Psychoacoustics and Audiological Applications

Every student is responsible for proper registration for the exam in TUMOnline.

Seminar topics for SS 2020

The impact of peripheral mechanisms on the precedence effect

Advisor: Norbert Kolotzek, M.Sc.
Publication: Pastore, M. T., & Braasch, J. (2019). The impact of peripheral mechanisms on the precedence effect. The Journal of the Acoustical Society of America, 146 (1), 425–444.
Abstract: When two similar sounds are presented from different locations, with one (the lead) preceding the other (the lag) by a small delay, listeners typically report hearing one sound near the location of the lead sound source—this is called the precedence effect (PE). Several questions about the underlying mechanisms that produce the PE are asked. (1) How might listeners’ relative weighting of cues at onset versus ongoing stimulus portions affect perceived lateral position of long-duration lead/lag noise stimuli? (2) What are the factors that influence this weighting? (3) Are the mechanisms invoked to explain the PE for transient stimuli applicable to long-duration stimuli? To answer these questions, lead/lag noise stimuli are presented with a range of durations, onset slopes, and lag-to-lead level ratios over headphones. Monaural, peripheral mechanisms, and binaural cue extraction are modeled to estimate the cues available for determination of perceived laterality. Results showed that all three stimulus manipulations affect the relative weighting of onset and ongoing cues and that mechanisms invoked to explain the PE for transient stimuli are also applicable to the PE, in terms of both onset and ongoing segments of long-duration, lead/lag stimuli.

Auditory motion perception emerges from successive sound localizations integrated over time

Advisor: Norbert Kolotzek, M.Sc.
Publication: Roggerone, V., Vacher, J., Tarlao, C., & Guastavino, C. (2019). Auditory motion perception emerges from successive sound localizations integrated over time. Scientific Reports, 9 (16437).
Abstract: Humans rely on auditory information to estimate the path of moving sound sources. But unlike in vision, the existence of motion-sensitive mechanisms in audition is still open to debate. Psychophysical studies indicate that auditory motion perception emerges from successive localization, but existing models fail to predict experimental results. However, these models do not account for any temporal integration. We propose a new model tracking motion using successive localization snapshots but integrated over time. This model is derived from psychophysical experiments on the upper limit for circular auditory motion perception (UL), defined as the speed above which humans no longer identify the direction of sounds spinning around them. Our model predicts ULs measured with different stimuli using solely static localization cues. The temporal integration blurs these localization cues rendering them unreliable at high speeds, which results in the UL. Our findings indicate that auditory motion perception does not require motion-sensitive mechanisms.

Neural coding and perception of auditory motion direction based on interaural time differences

Advisor: Norbert Kolotzek, M.Sc.
Publication: Zuk, N. J., & Delgutte, B. (2019). Neural coding and perception of auditory motion direction based on interaural time differences. Journal of Neurophysiology, 122, 1821–1842 .

Neural coding and perception of auditory motion direction based on interaural time differences.While motion is important for parsing a complex auditory scene into perceptual objects, how it is encoded in the auditory system is unclear. Perceptual studies suggest that the ability to identify the direction of motion is limited by the duration of the moving sound, yet we can detect changes in interaural differences at even shorter durations. To understand the source of these distinct temporal limits, we recorded from single units in the inferior collicu-lus (IC) of unanesthetized rabbits in response to noise stimuli containing a brief segment with linearly time-varying interaural time difference ("ITD sweep") temporally embedded in interaurally uncor-related noise. We also tested the ability of human listeners to either detect the ITD sweeps or identify the motion direction. Using a point-process model to separate the contributions of stimulus dependence and spiking history to single-neuron responses, we found that the neurons respond primarily by following the instantaneous ITD rather than exhibiting true direction selectivity. Furthermore, using an optimal classifier to decode the single-neuron responses, we found that neural threshold durations of ITD sweeps for both direction identification and detection overlapped with human threshold durations even though the average response of the neurons could track the instantaneous ITD beyond psychophysical limits. Our results suggest that the IC does not explicitly encode motion direction, but internal neural noise may limit the speed at which we can identify the direction of motion.

Prediction of individual speech recognition performance in complex listening conditions

Advisor: Ľuboš Hládek, PhD
Publication: Kubiak, A. M., Rennies, J., Ewert, S. D., & Kollmeier, B. (2020). Prediction of individual speech recognition performance in complex listening conditions. The Journal of the Acoustical Society of America, 147 (3), 1379–139.
Abstract: This study examined how well individual speech recognition thresholds in complex listening scenarios could be predicted by a current binaural speech intelligibility model. Model predictions were compared with experimental data measured for seven normal-hearing and 23 hearing-impaired listeners who differed widely in their degree of hearing loss, age, as well as performance in clinical speech tests. The experimental conditions included two masker types (multi-talker or two-talker maskers), and two spatial conditions (maskers co-located with the frontal target or symmetrically separated from the target). The results showed that interindividual variability could not be well predicted by a model including only individual audiograms. Predictions improved when an additional individual “proficiency factor” was derived from one of the experimental conditions or a standard speech test. Overall, the current model can predict individual performance relatively well (except in conditions high in informational masking), but the inclusion of age-related factors may lead to even further improvements.

The effect of spatial energy spread on sound image size and speech intelligibility

Advisor: Ľuboš Hládek, PhD

Ahrens, A., Marschall, M., & Dau, T. (2020). The effect of spatial energy spread on sound image size and speech intelligibility. The Journal of the Acoustical Society of America, 147 (3), 1368–1378.

Abstract: This study explored the relationship between perceived sound image size and speech intelligibility for sound sources reproduced over loudspeakers. Sources with varying degrees of spatial energy spread were generated using ambisonics processing. Young normal-hearing listeners estimated sound image size as well as performed two spatial release from masking (SRM) tasks with two symmetrically arranged interfering talkers. Either the target-to-masker ratio or the separation angle was varied adaptively. Results showed that the sound image size did not change system- atically with the energy spread. However, a larger energy spread did result in a decreased SRM. Furthermore, the lis- teners needed a greater angular separation angle between the target and the interfering sources for sources with a larger energy spread. Further analysis revealed that the method employed to vary the energy spread did not lead to systematic changes in the interaural cross correlations. Future experiments with competing talkers using ambisonics or similar methods may consider the resulting energy spread in relation to the minimum separation angle between sound sources in order to avoid degradations in speech intelligibility.

Audio-Visual Speech Intelligibility Benefits with Bilateral Cochlear Implants when Talker Location Varies

Advisor: Ľuboš Hládek, PhD

van Hoesel, R. J. (2015). Audio-Visual Speech Intelligibility Benefits with Bilateral Cochlear Implants when Talker Location Varies. Journal of the Association for Research in Otolaryngology, 16 (2), 309–315.

Abstract: One of the key benefits of using cochlear implants (CIs) in both ears rather than just one is improved localization. It is likely that in complex listening scenes, improved localization allows bilateral CI users to orient toward talkers to improve signal-to-noise ratios and gain access to visual cues, but to date, that conjecture has not been tested. To obtain an objective measure of that benefit, seven bilateral CI users were assessed for both auditory-only and audio-visual speech intelligibility in noise using a novel dynamic spatial audio-visual test paradigm. For each trial conducted in spatially distributed noise, first, an auditory-only cueing phrase that was spoken by one of four talkers was selected and presented from one of four locations. Shortly afterward, a target sentence was presented that was either audio-visual or, in another test configuration, audio-only and was spoken by the same talker and from the same location as the cueing phrase. During the target presentation, visual distractors were added at other spatial locations. Results showed that in terms of speech reception thresholds (SRTs), the average improvement for bilateral listening over the better performing ear alone was 9 dB for the audio-visual mode, and 3 dB for audition-alone. Comparison of bilateral performance for audio-visual and audition-alone showed that inclusion of visual cues led to an average SRT improvement of 5 dB. For unilateral device use, no such benefit arose, presumably due to the greatly reduced ability to localize the target talker to acquire visual information. The bilateral CI speech intelligibility advantage over the better ear in the present study is much larger than that previously reported for static talker locations and indicates greater everyday speech benefits and improved cost-benefit than estimated to date.

Spatial release from masking under different reverberant conditions in young and elderly subjects

Advisor: Ľuboš Hládek, PhD

Muñoz, R. V., Aspöck, L., & Fels, J. (2019). Spatial release from masking under different reverberant conditions in young and elderly subjects. Journal of Speech, Language, and Hearing Research, 62 (9), 3582–2595.

Abstract: Purpose: Normal-hearing and hard-of-hearing listeners suffer from reduced speech intelligibility in noisy and reverberant environments. Although daily listening environments are in constant motion, most researchers have only studied speech-in-noise perception for stationary masker locations. The aim of this study was to investigate the spatial release from masking (SRM) of circularly and radially moving maskers under different room acoustic conditions for young and elderly subjects. Method: Twelve young subjects with normal hearing and 12 elderly subjects with normal hearing or mild hearing loss were tested. Several different room acoustic conditions were simulated and reproduced via headphones using binaural synthesis. The target speech stream consisted of German digit triplets, and masker stream consisted of quasistationary noise with matched long-term averaged speech spectra. During the experiment, the position of the masker was changed to be in different stationary positions, or varied continuously. In the latter case, it was moved either on a circular trajectory spanning a 90° azimuth angle or on a radial trajectory linearly increasing the distance to the receiver from 0.5 m to 1.8 m. Absorption characteristics of the virtual room's surfaces were changed, recreating an anechoic room, a treated room with mean reverberation times (RT60) = 0.48 s, and an untreated room with mean RT60 = 1.26 s. Results: For the circular condition, a significant difference was found between moving and stationary maskers, F(4, 44) = 20.91, p < .001, with a bigger SRM for stationary maskers than moving masker conditions. Also, both age groups displayed a significant decrease in SRM over the reverberation conditions: F(2, 22) = 12.24, p < .001. For the radial condition, both age groups showed a significant decrease in SRM over the reverberation conditions, F(2, 22) = 13.62, p < .001, as well as the moving and stationary masker conditions, F(8, 88) = 29.23, p < .001. In general, the SRM of a moving masker decreased when the reverberation increased, especially for elderly subjects. Conclusions: A radially moving masker led to improved SRM in an anechoic environment for both age groups, whereas a circularly moving masker caused degraded SRM, especially for elderly subjects in the highly reverberant environment.

Combining the remote microphone technique with head-tracking for local active sound control

Advisor: Ramona Beinstingel, M.Sc.
Publication: Jung, W., Elliott, S. J., & Cheer, J. (2017). Combining the remote microphone technique with head-tracking for local active sound control. The Journal of the Acoustical Society of America,142 (1), 298–307.
Abstract: This paper describes practical integration of the remote microphone technique with a head-tracking device in a local active noise control system. The formulation is first reviewed for the optimized observation filter and nearfield pressure estimation. The attenuation performance and stability of an adaptive active headrest system combined with the remote microphone technique are then studied. The accuracy of the nearfield estimation and the effect of the head-tracking on the control performance are investigated in real-time experiments. The regularization factor of the observation filter is selected as a trade-off between its accuracy and its robustness. The integrated active headrest system is used to estimate and attenuate disturbance signals at a listener's ears from a single tonal primary source, while a commercial head-tracking device detects and provides the real-time head position to the active headrest system whose responses are updated accordingly.

Externalization of remote microphone signals using a structural binaural model of the head and pinna

Advisor: Ramona Beinstingel, M.Sc.
Publication: Kates, J. M., Arehart, K. H., Muralimanohar, R. K., & Sommerfeldt, K. (2018). Externalization of remote microphone signals using a structural binaural model of the head and pinna. The Journal of the Acoustical Society of America, 143 (5), 2666–2677.
Abstract: In a remote microphone (RM) system, a talker speaks into a microphone and the signal is transmitted to the hearing aids worn by the hearing-impaired listener. A difficulty with remote microphones, however, is that the signal received at the hearing aid bypasses the head and pinna, so the acoustic cues needed to externalize the sound source are missing. The objective of this paper is to process the RM signal to improve externalization when listening through earphones. The processing is based on a structural binaural model, which uses a cascade of processing modules to simulate the interaural level difference, interaural time difference, pinna reflections, ear-canal resonance, and early room reflections. The externalization results for the structural binaural model are compared to a left-right signal blend, the listener's own anechoic head-related impulse response (HRIR), and the listener's own HRIR with room reverberation. The azimuth is varied from straight ahead to 90° to one side. The results show that the structural binaural model is as effective as the listener's own HRIR plus reverberation in producing an externalized acoustic image, and that there is no significant difference in externalization between hearing-impaired and normal-hearing listeners.

Personalized HRTF modeling based on deep neural network using anthropometric measurements and images of the ear

Advisor: Ramona Beinstingel, M.Sc.

Lee, G. W., & Kim, H. K. (2018). Personalized HRTF modeling based on deep neural network using anthropometric measurements and images of the ear. Applied Sciences (Switzerland), 8 (11).

Abstract: This paper proposes a personalized head-related transfer function (HRTF) estimation method based on deep neural networks by using anthropometric measurements and ear images. The proposed method consists of three sub-networks for representing personalized features and estimating the HRTF. As input features for neural networks, the anthropometric measurements regarding the head and torso are used for a feedforward deep neural network (DNN), and the ear images are used for a convolutional neural network (CNN). After that, the outputs of these two sub-networks are merged into another DNN for estimation of the personalized HRTF. To evaluate the performance of the proposed method, objective and subjective evaluations are conducted. For the objective evaluation, the root mean square error (RMSE) and the log spectral distance (LSD) between the reference HRTF and the estimated one are measured. Consequently, the proposed method provides the RMSE of -18.40 dB and LSD of 4.47 dB, which are lower by 0.02 dB and higher by 0.85 dB than the DNN-based method using anthropometric data without pinna measurements, respectively. Next, a sound localization test is performed for the subjective evaluation. As a result, it is shown that the proposed method can localize sound sources with higher accuracy of around 11% and 6% than the average HRTF method and DNN-based method, respectively. In addition, the reductions of the front/back confusion rate by 12.5% and 2.5% are achieved by the proposed method, compared to the average HRTF method and DNN-based method, respectively.

Binaural direct-to-reverberant energy ratio and speaker distance estimation

Advisor: Dipl.-Ing. Matthieu Kuntz

Zohourian, M., & Martin, R. (2020). Binaural direct-to-reverberant energy ratio and speaker distance estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 92–104.

Abstract: This article addresses the problem of distance estimation using binaural hearing aid microphones in reverberant rooms. Among several distance indicators, the direct-to-reverberant energy ratio (DRR) has been shown to be more effective than other features. Therefore, we present two novel approaches to estimate the DRR of binaural signals. The first method is based on the interaural magnitude-squared coherence whereas the second approach uses stochastic maximum likelihood beamforming to estimate the power of the direct and reverberant components. The proposed DRR estimation algorithms are integrated into a distance estimation technique. When based solely on DRR, the distance estimation algorithm requires calibration where naturally the critical distance is a good calibration point. We thus propose two approaches for the calibration of the distance estimation algorithm: Informed calibration using the critical distance of the reverberant room and blind calibration using the listener's own voice. Results across various acoustical environments show the benefit of the proposed algorithms for the estimation of sound source distances up to 3 m with an estimation error of about 35 cm using informed calibration and about 1 m using the fully blind calibration strategy.

Cortical auditory distance representation based on direct-to-reverberant energy ratio

Advisor: Han Li, B.E.

Kopco, N., Doreswamy, K. K., Huang, S., Rossi, S., & Ahveninen, J. (2020). Cortical auditory distance representation based on direct-to-reverberant energy ratio. NeuroImage, 208 (September 2019), 116436.


Auditory distance perception and its neuronal mechanisms are poorly understood, mainly because 1) it is difficult to separate distance processing from intensity processing, 2) multiple intensity-independent distance cues are often available, and 3) the cues are combined in a context-dependent way. A recent fMRI study identified human auditory cortical area representing intensity-independent distance for sources presented along the interaural axis (Kopco et al. PNAS, 109, 11019-11024). For these sources, two intensity-independent cues are available, interaural level difference (ILD) and direct-to-reverberant energy ratio (DRR). Thus, the observed activations may have been contributed by not only distance-related, but also direction-encoding neuron populations sensitive to ILD. Here, the paradigm from the previous study was used to examine DRR-based distance representation for sounds originating in front of the listener, where ILD is not available. In a virtual environment, we performed behavioral and fMRI experiments, combined with computational analyses to identify the neural representation of distance based on DRR. The stimuli varied in distance (15-100 cm) while their received intensity was varied randomly and independently of distance. Behavioral performance showed that intensity-independent distance discrimination is accurate for frontal stimuli, even though it is worse than for lateral stimuli. fMRI activations for sounds varying in frontal distance, as compared to varying only in intensity, increased bilaterally in the posterior banks of Heschl's gyri, the planum temporale, and posterior superior temporal gyrus regions. Taken together, these results suggest that posterior human auditory cortex areas contain neuron populations that are sensitive to distance independent of intensity and of binaural cues relevant for directional hearing.

Informational masking of negative masking

Advisor: Dipl.-Ing. Matthieu Kuntz

Conroy, C., Mason, C. R., & Kidd, G. (2020). Informational masking of negative masking. The Journal of the Acoustical Society of America, 147 (2), 798–811.


Negative masking (NM) is a ubiquitous finding in near-“threshold” psychophysics in which the detectability of a near-threshold signal improves when added to a copy of itself, i.e., a pedestal or masker. One interpretation of NM suggests that the pedestal acts as an informative cue, thereby reducing uncertainty and improving performance relative to detection in its absence. The purpose of this study was to test this hypothesis. Intensity discrimination thresholds were measured for 100-ms, 1000-Hz near-threshold tones. In the reference condition, thresholds were measured in quiet (no masker other than the pedestal). In comparison conditions, thresholds were measured in the presence of one of two additional maskers: a notched-noise masker or a random-frequency multitone masker. The additional maskers were intended to cause different amounts of uncertainty and, in turn, to differentially influence NM. The results were generally consistent with an uncertainty-based interpretation of NM: NM was found both inquiet and in notched-noise, yet it was eliminated by the multitone masker. A competing interpretation of NM based on nonlinear transduction does not account for all of the results. Profile analysis may have been a factor in performance and this suggests that NM may be attributable to, or influenced by, multiple mechanisms.

Auditory figure-ground segregation is impaired by high visual load

Advisor: Han Li, B.E.

Molloy, K., Lavie, N., & Chait, M. (2019). Auditory figure-ground segregation is impaired by high visual load. Journal of Neuroscience, 39 (9), 1699–1708.

Abstract: Figure-ground segregation is fundamental to listening in complex acoustic environments. An ongoing debate pertains to whether segregation requires attention or is “automatic” and preattentive. In this magnetoencephalography study, we tested a prediction derived from load theory of attention (e.g., Lavie, 1995) that segregation requires attention but can benefit from the automatic allocation of any “leftover” capacity under low load. Complex auditory scenes were modeled with stochastic figure-ground stimuli (Teki et al., 2013), which occasionally contained repeated frequency component “figures.” Naive human participants (both sexes) passively listened to these signals while performing a visual attention task of either low or high load. While clear figure-related neural responses were observed under conditions of low load, high visual load substantially reduced the neural response to the figure in auditory cortex (planum temporale, Heschl's gyrus). We conclude that fundamental figure-ground segregation in hearing is not automatic but draws on resources that are shared across vision and audition.

Speaker-independent auditory attention decoding without access to clean speech sources

Advisor: Han Li, B.E.

Han, C., O’Sullivan, J., Luo, Y., Herrero, J., Mehta, A. D., & Mesgarani, N. (2019). Speaker-independent auditory attention decoding without access to clean speech sources. Science Advances, 5 (5), 1–12.

Abstract: Speech perception in crowded environments is challenging for hearing-impaired listeners. Assistive hearing devices cannot lower interfering speakers without knowing which speaker the listener is focusing on. One possible solution is auditory attention decoding in which the brainwaves of listeners are compared with sound sources to determine the attended source, which can then be amplified to facilitate hearing. In realistic situations, however, only mixed audio is available. We utilize a novel speech separation algorithm to automatically separate speakers in mixed audio, with no need for the speakers to have prior training. Our results show that auditory attention decoding with automatically separated speakers is as accurate and fast as using clean speech sounds. The proposed method significantly improves the subjective and objective quality of the attended speaker. Our study addresses a major obstacle in actualization of auditory attention decoding that can assist hearing-impaired listeners and reduce listening effort for normal-hearing subjects.


The registration for the seminar can be done via TUMOnline.