Advanced seminar on audio information processing

Assistant: Norbert Kolotzek, M.Sc. and the AIP team
Turnus: Winter and summer semester
Target Group: Wahlmodul zur fachlichen Ergänzung (Master EI)
Doctoral seminar
Schedule: 2 SWS
Exam: oral
Time & Place: Thursday, 09:45 - 11:15 hours, N6507
Dates: Start on 17.10.2019 (topic selection)


The seminar is targeted at advanced students, PhD candidates and post-docs in the field of audio-information processing. Scientific publications on current topics in audio-information processing are presented in a small group and discussed in depth ("journal club"). Each participant will present at last one publication (usually two) and lead the discussion. To prepare for the discussion each participant will read the material prior to each seminar meeting. The focus of the seminar is on understanding and discussing the content. Participants get to know current topics in audio-information processing, train the comprehension of English-language scientific publications and practice scientific discourse as well as leading a discussion.

Previous knowledge expected: Leture Psychoacoustics and Audiological Applications

Every student is responsible for proper registration for the exam in TUMOnline.

Seminar topics for WS 19/20

Real-time decoding of question-and-answer speech dialogue using human cortical activity

Advisor: Prof. Bernhard Seeber

Moses, D.A., Leonard, M.K., Makin, J.G., Chang, E.F. (2019). Real-time decoding of question-and-answer speech dialogue using human cortical activity. Nature Communications, 10:3096.


Natural communication often occurs in dialogue, differentially engaging auditory and sensorimotor brain regions during listening and speaking. However, previous attempts to decode speech directly from the human brain typically consider listening or speaking tasks in isolation. Here, human participants listened to questions and responded aloud with answers while we used high-density electrocorticography (ECoG) recordings to detect when they heard or said an utterance and to then decode the utterance’s identity. Because certain answers were only plausible responses to certain questions, we could dynamically update the prior probabilities of each answer using the decoded question likelihoods as context. We decode produced and perceived utterances with accuracy rates as high as 61% and 76%, respectively (chance is 7% and 20%). Contextual integration of decoded question likelihoods significantly improves answer decoding. These results demonstrate real-time decoding of speech in an interactive, conversational setting, which has important implications for patients who are unable to communicate.

Adaptation of human auditory cortex to changing background noise

Advisor: Norbert Kolotzek, M.Sc.
Publication: Khalighinejad, B., Herrero, J.L., Mehta, A.D., Mesgarani, N. (2019). Adaptation of human auditory cortex to changing background noise. Nature Communications, 10:2509.
Abstract: Speech communication in real-world environments requires adaptation to changing acoustic conditions. How the human auditory cortex adapts as a new noise source appears in or disappears from the acoustic scene remain unclear. Here, we directly measured neural activity in the auditory cortex of six human subjects as they listened to speech with abruptly changing background noises. We report rapid and selective suppression of acoustic features of noise in the neural responses. This suppression results in enhanced representation and perception of speech acoustic features. The degree of adaptation to different background noises varies across neural sites and is predictable from the tuning properties and speech specificity of the sites. Moreover, adaptation to background noise is unaffected by the attentional focus of the listener. The convergence of these neural and perceptual effects reveals the intrinsic dynamic mechanisms that enable a listener to filter out irrelevant sound sources in a changing acoustic scene.

Neurons in primary auditory cortex represent sound source location in a cue-invariant manner

Advisor: Norbert Kolotzek, M.Sc.
Publikation: Wood, K.C., Town, S.M., Bizley, J.K. (2019). Neurons in primary auditory cortex represent sound source location in a coue-invariant manner. Nature Communications, 10:3019.
Abstract: Auditory cortex is required for sound localisation, but how neural firing in auditory cortex underlies our perception of sound sources in space remains unclear. Specifically, whether neurons in auditory cortex represent spatial cues or an integrated representation of auditory space across cues is not known. Here, we measured the spatial receptive fields of neurons in primary auditory cortex (A1) while ferrets performed a relative localisation task. Manipulating the availability of binaural and spectral localisation cues had little impact on ferrets’ performance, or on neural spatial tuning. A subpopulation of neurons encoded spatial position consistently across localisation cue type. Furthermore, neural firing pattern decoders outperformed two-channel model decoders using population activity. Together, these observations suggest that A1 encodes the location of sound sources, as opposed to spatial cue values.

Binaural unmasking with temporal envelope and fine structure in listeners with cochlear implants

Advisor: Norbert Kolotzek, M.Sc.
Publikation: Todd, A.E., Goupell, M.J., Litovski, R.Y. (2019). Binaural unmasking with temporal envelope and fine structure in listeners with cochlear implants. The Journal of the Acoustical Society of America, 145(5), 2982-2993 .
Abstract: For normal-hearing (NH) listeners, interaural information in both temporal envelope and temporal fine structure contribute to binaural unmasking of target signals in background noise; however, in many conditions low-frequency interaural information in temporal fine structure produces greater binaural unmasking. For bilateral cochlear-implant (CI) listeners, interaural information in temporal envelope contributes to binaural unmasking; however, the effect of encoding temporal fine structure information in electrical pulse timing (PT) is not fully understood. In this study, diotic and dichotic signal detection thresholds were measured in CI listeners using bilaterally synchronized single-electrode stimulation for conditions in which the temporal envelope was presented without temporal fine structure encoded (constant-rate pulses) or with temporal fine structure encoded (pulses timed to peaks of the temporal fine structure). CI listeners showed greater binaural unmasking at125 pps with temporal fine structure encoded than without. There was no significant effect of encoding temporal fine structure at 250 pps. A similar pattern of performance was shown by NH listeners presented with acoustic pulse trains designed to simulate CI stimulation. The results suggest a trade-off across low rates between interaural information obtained from temporal envelope and that obtained from temporal fine structure encoded in PT.

Specifying the perceptual relevance of onset transients for musical instrument identification

Advisor: Clara Hollomey, PhD
Publication: Siedenburg, K. (2019). Specifying the perceptual relevance of onset transients for musical instrument identification. The Journal of the Acoustical Society of America, 145(2), 1078-1087.
Abstract: Sound onsets are commonly considered to play a privileged role in the identification of musical instruments, but the underlying acoustic features remain unclear. By using sounds resynthesized with and without rapidly varying transients (not to be confused with the onset as a whole), this study set out to specify precisely the role of transients and quasi-stationary components in the perception of musical instrument sounds. In experiment 1, listeners were trained to identify ten instruments from 250 ms sounds. In a subsequent test phase, listeners identified instruments from 64 ms segments of sounds presented with or without transient components, either taken from the onset, or from the middle portion of the sounds. The omission of transient components at the onset impaired overall identification accuracy only by 6%, even though experiment 2 suggested that their omission was discriminable. Shifting the position of the gate from the onset to the middle portion of the tone impaired overall identification accuracy by 25%. Taken together, these findings confirm the prominent status of onsets in musical instrument identification, but suggest that rapidly varying transients are less indicative of instrument identity compared to the relatively slow buildup of sinusoidal components during onsets.

Integrating a remote microphone with hearing-aid processing

Advisor: Clara Hollomey, PhD

Kates, J.M., Arehart, K.H., Harvey, L.O. (2019). Integrating a remote microphone with hearing-aid processing. The Journal of the Acoustical Society of America, 145(6), 3551-3566.

Abstract: A remote microphone (RM) links a talker’s microphone to a listener’s hearing aids (HAs). The RM improves intelligibility in noise and reverberation, but the binaural cues necessary for externalization are lost. Augmenting the RM signal with synthesized binaural cues and early reflections enhances externalization, but interactions of the RM signal with the HA processing could reduce its effectiveness. These potential interactions were evaluated using RM plus HA processing in a realistic listening simulation. The HA input was the RM alone, the augmented RM signal, the acoustic inputs at the HA microphones, including reverberation measured using a dummy head, or a mixture of the augmented RM and acoustic input signals. The HA simulation implemented linear amplification or independent dynamic-range compression at the two ears and incorporated the acoustic effects of vented earmolds. Hearing-impaired listeners scored sentence stimuli for intelligibility and rated clarity, overall quality, externalization, and apparent source width. Using the RM improved intelligibility but reduced the spatial impression. Increasing the vent diameter reduced clarity and increased the spatial impression. Listener ratings reflect a trade-off between the attributes of clarity and overall quality and the attributes of externalization and source width that can be explained using the interaural cross correlation.

The processing and perception of size information in speech sounds

Advisor: Clara Hollomey, PhD

Smith, D.R.R., Patterson, R.D., Turner, R. (2005). The processing and perception of size information in speech sounds. The Journal of the Acoustical Society of America. 117(1), 305-318.

Abstract: There is information in speech sounds about the length of the vocal tract; specifically, as a child grows, the resonators in the vocal tract grow and the formant frequencies of the vowels decrease. It has been hypothesized that the auditory system applies a scale transform to all sounds to segregate size information from resonator shape information, and thereby enhance both size perception and speech recognition [Irino and Patterson, Speech Commun. 36, 181–203 (2002)]. This paper describes size discrimination experiments and vowel recognition experiments designed to provide evidence for an auditory scaling mechanism. Vowels were scaled to represent people with vocal tracts much longer and shorter than normal, and with pitches much higher and lower than normal. The results of the discrimination experiments show that listeners can make fine judgments about the relative size of speakers, and they can do so for vowels scaled well beyond the normal range. Similarly, the recognition experiments show good performance for vowels in the normal range, and for vowels scaled well beyond the normal range of experience. Together, the experiments support the hypothesis that the auditory system automatically normalizes for the size information in communication sounds.

A deep learning algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker and reverberation

Advisor: Han Li

Healy, E.W., Delfarah, M., Johnson, E.M., Wang, D. (2019). A deep learing algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker and reverberation. The Journal of the Acoustical Society of America, 145(3), 1378-1388.

Abstract: For deep learning based speech segregation to have translational significance as a noise-reduction tool, it must perform in a wide variety of acoustic environments. In the current study, performance was examined when target speech was subjected to interference from a single talker and room reverberation. Conditions were compared in which an algorithm was trained to remove both reverberation and interfering speech, or only interfering speech. A recurrent neural network incorporating bidirectional long short-term memory was trained to estimate the ideal ratio mask corresponding to target speech. Substantial intelligibility improvements were found for hearing-impaired (HI) and normal-hearing (NH) listeners across a range of target-to-interferer ratios (TIRs). HI listeners performed better with reverberation removed, whereas NH listeners demonstrated no difference. Algorithm benefit averaged 56 percentage points for the HI listeners at the least-favorable TIR, allowing these listeners to perform numerically better than young NH listeners without processing. The current study highlights the difficulty associated with perceiving speech in reverberant-noisy environments, and it extends the range of environments in which deep learning based speech segregation can be effectively applied. This increasingly wide array of environments includes not only a variety of background noises and interfering speech, but also room reverberation.

Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation

Advisor: Han Li

Liu, Y , and Wang, DL. Source. (2019). Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation. arXiv preprint arXiv: 1904.11148

Abstract: We address talker-independent monaural speaker separation from the perspectives of deep learning and computational auditory scene analysis (CASA). Specifically, we decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping. Simultaneous grouping is first performed in each time frame by separating the spectra of different speakers with a permutation-invariantly trained neural network. In the second stage, the frame-level separated spectra are sequentially grouped to different speakers by a clustering network. The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives. Experimental results on the benchmark WSJ0-2mix database show that the new approach achieves the state-of-the-art results with a modest model size.

Measuring Speech Recognition With a Matrix Test Using Synthetic Speech

Advisor: Ľuboš Hládek, PhD

Nuesse, T., Wiercinski, B., Brand, T., Holube, I. (2019). Measuring Speech Recognition With a Matrix Test Using Synthetic Speech. Trends in Hearing, 23, 1-14;
doi: 10.1177/2331216519862982

Abstract: Speech audiometry is an essential part of audiological diagnostics and clinical measurements. Development times of speechrecognition tests are rather long, depending on the size of speech corpus and optimization necessity. The aim of this studywas to examine whether this development effort could be reduced by using synthetic speech in speech audiometry, especiallyin a matrix test for speech recognition. For this purpose, the speech material of the German matrix test was replicated usinga preselected commercial system to generate the synthetic speech files. In contrast to the conventional matrix test, no leveladjustments or optimization tests were performed while producing the synthetic speech material. Evaluation measurementswere conducted by presenting both versions of the German matrix test (with natural or synthetic speech), alternately and atthree different signal-to-noise ratios, to 48 young, normal-hearing participants. Psychometric functions were fitted to theempirical data. Speech recognition thresholds were 0.5 dB signal-to-noise ratio higher (worse) for the synthetic speech, whileslopes were equal for both speech types. Nevertheless, speech recognition scores were comparable with the literature andthe threshold difference lay within the same range as recordings of two different natural speakers. Although no optimizationwas applied, the synthetic-speech signals led to equivalent recognition of the different test lists and word categories. Theoutcomes of this study indicate that the application of synthetic speech in speech recognition tests could considerably reducethe development costs and evaluation time. This offers the opportunity to increase the speech corpus for speech recognitiontests with acceptable effort.

Influnece of Multi-microphone Signal Enhancement Algorithms on the Acoustics and Detectability of Angular and Radial Source Movement

Advisor: Ľuboš Hládek, PhD

Lundbeck, M., Hartog, L., Grimm, G., Hohmann, V., Bramsløw, L., Neher, T. (2019). Influence of Multi-microphone Signal Enhancement Algorithms on the Acoustics and Detectability of Angular and Radial Source Movements, Trends in Hearing, 22, 1-13;
doi: 10.1177/2331216518779719

Abstract: Hearing-impaired listeners are known to have difficulties not only with understanding speech in noise but also with judging source distance and movement, and these deficits are related to perceived handicap. It is possible that the perception of spatially dynamic sounds can be improved with hearing aids (HAs), but so far this has not been investigated. In a previous study, older hearing-impaired listeners showed poorer detectability for virtual left-right (angular) and near-far (radial) source movements due to lateral interfering sounds and reverberation, respectively. In the current study, potential ways of improving these deficits with HAs were explored. Using stimuli very similar to before, detailed acoustic analyses were carried out to examine the influence of different HA algorithms for suppressing noise and reverberation on the acoustic cues previously shown to be associated with source movement detectability. For an algorithm that combined unilateral directional microphones with binaural coherence-based noise reduction and for a bilateral beamformer with binaural cue preservation, movement-induced changes in spectral coloration, signal-to-noise ratio, and direct-to-reverberant energy ratio were greater compared with no HA processing. To evaluate these two algorithms perceptually, aided measurements of angular and radial source movement detectability were performed with 20 older hearing-impaired listeners. The analyses showed that, in the presence of concurrent interfering sounds and reverberation, the bilateral beamformer could restore source movement detectability in both spatial dimensions, whereas the other algorithm only improved detectability in the near-far dimension. Together, these results provide a basis for improving the detectability of spatially dynamic sounds with HAs.

Visual Speech Benefit in Clear and Degraded Speech Depends on the Auditory Intelligibility of the Talker and the Number of Background Talkers

Advisor: Ľuboš Hládek, PhD

Blackburn, C.L., Kitterick, P.T., Jones, G., Sumner, C.J., Stacey, P.C. (2019). Visual Speech Benefit in Clear and Degraded Speech Depends on the Auditory Intelligibility of the Talker and the Number of Background Talkers, Trends in Hearing, 23, 1-14;
doi: 10.1177/2331216519837866

Abstract: Perceiving speech in background noise presents a significant challenge to listeners. Intelligibility can be improved by seeing the face of a talker. This is of particular value to hearing impaired people and users of cochlear implants. It is well known that auditory-only speech understanding depends on factors beyond audibility. How these factors impact on the audio-visual integration of speech is poorly understood. We investigated audio-visual integration when either the interfering background speech (Experiment 1) or intelligibility of the target talkers (Experiment 2) was manipulated. Clear speech was also contrasted with sine-wave vocoded speech to mimic the loss of temporal fine structure with a cochlear implant. Experiment 1 showed that for clear speech, the visual speech benefit was unaffected by the number of background talkers. For vocoded speech, a larger benefit was found when there was only one background talker. Experiment 2 showed that visual speech benefit depended upon the audio intelligibility of the talker and increased as intelligibility decreased. Degrading the speech by vocoding resulted in even greater benefit from visual speech information. A single ‘‘independent noise’’ signal detection theory model predicted the overall visual speech benefit in some conditions but could not predict the different levels of benefit across variations in the background or target talkers. This suggests that, similar to audio-only speech intelligibility, the integration of audio-visual speech cues may be functionally dependent on factors other than audibility and task difficulty, and that clinicians and researchers should carefully consider the characteristics of their stimuli when assessing audio-visual integration.

An Extended Binaural Real-Time Auralization System With an Interface to Research Hearing Aids for Experiments on Subjects With Hearing Loss

Advisor: Ľuboš Hládek, PhD

Pausch, F., Aspöck, L, Vorländer, M., Fels, J. (2018). An Extended Binaural Real-Time Auralization System With an Interface to Research Hearing Aids for Experiments on Subjects With Hearing Loss, Trends in Hearing, 22, 1-32;
doi: 10.1177/2331216518800871.


Theory and implementation of acoustic virtual reality have matured and become a powerful tool for the simulation of entirely controllable virtual acoustic environments. Such virtual acoustic environments are relevant for various types of auditory experiments on subjects with normal hearing, facilitating flexible virtual scene generation and manipulation. When it comes to expanding the investigation group to subjects with hearing loss, choosing a reproduction system which offers a proper integration of hearing aids into the virtual acoustic scene is crucial. Current loudspeaker-based spatial audio reproduction systems rely on different techniques to synthesize a surrounding sound field, providing various possibilities for adaptation and extension to allow applications in the field of hearing aid-related research. Representing one option, the concept and implementation of an extended binaural real-time auralization system is presented here. This system is capable of generating complex virtual acoustic environments, including room acoustic simulations, which are reproduced as combined via loudspeakers and research hearing aids. An objective evaluation covers the investigation of different system components, a simulation benchmark analysis for assessing the processing performance, and end-to-end latency measurements.

Effect of Noise Reduction Gain Errors on Simulated Cochlear Implant Speech Intelligibility

Advisor: Dipl.-Ing. Matthieu Kuntz

Kressner, A.A., May, T., Dau, T. (2019). Effect of Noise Reduction Gain Errors on Simulated Cochlear Implant Speech Intelligibility, Trends in Hearing, 23, 1-12;
doi: 10.1177/2331216519825930.


It has been suggested that the most important factor for obtaining high speech intelligibility in noise with cochlear implant (CI) recipients is to preserve the low-frequency amplitude modulations of speech across time and frequency by, for example, minimizing the amount of noise in the gaps between speech segments. In contrast, it has also been argued that the transient parts of the speech signal, such as speech onsets, provide the most important information for speech intelligibility. The present study investigated the relative impact of these two factors on the potential benefit of noise reduction for CI recipients by systematically introducing noise estimation errors within speech segments, speech gaps, and the transitions between them. The introduction of these noise estimation errors directly induces errors in the noise reduction gains within each of these regions. Speech intelligibility in both stationary and modulated noise was then measured using a CI simulation tested on normal-hearing listeners. The results suggest that minimizing noise in the speech gaps can improve intelligibility, at least in modulated noise. However, significantly larger improvements were obtained when both the noise in the gaps was minimized and the speech transients were preserved. These results imply that the ability to identify the boundaries between speech segments and speech gaps may be one of the most important factors for a noise reduction algorithm because knowing the boundaries makes it possible to minimize the noise in the gaps as well as enhance the low-frequency amplitude modulations of the speech.

Machine-learning-based estimation and rendering of scattering in virtual reality

Advisor: Dipl.-Ing. Matthieu Kuntz

Pulkki, V. and Svensson, P. (2019). Machine-learning-based estimation and rendering of scattering in virtual reality, Journal of the Acoustical Society of America, 145(4), 2664-2676.

Abstract: In this work, a technique to render the acoustic effect of scattering from finite objects in virtual reality is proposed, which aims to provide a perceptually plausible response for the listener, rather than a physically accurate response. The effect is implemented using parametric filter structures and the parameters for the filters are estimated using artificial neural networks. The networks may be trained with modeled or measured data. The input data consist of a set of geometric features describing a large quantity of source-object-receiver configurations, and the target data consist of the filter parameters computed using measured or modeled data. A proof-of-concept implementation is presented, where the geometric descriptions and computationally modeled responses of three-dimensional plate objects are used for training. In a dynamic test scenario, with a single source and plate, the approach is shown to provide a similar spectrogram when compared with a reference case, although some spectral differences remain present. Nevertheless, it is shown with a perceptual test that the technique produces only a slightly lower degree of plausibility than the state-of-the-art acoustic scattering model that accounts for diffraction, and also that the proposed technique yields a prominently higher degree of plausibility than a model that omits diffraction.


The registration for the seminar can be done via TUMOnline.