Modern systems for communication and information processing are prerequisites for a high quality interpersonal communication and information exchange. Moreover, those systems enable us to interact with all kinds of computers and computer controlled machines, e.g. to operate entertainment electronics, to access the internet, to use information services, or even to navigate a car. With ongoing technological progress, these systems do not only become more capable and efficient, but also more complex. Nowadays, these systems have become a part of everyday life, in contrast to former times, when only engineers and experts had to operate them. For this reason, an adequate and efficient user interface is a major goal of research and development to enable everyone to participate in modern communication infrastructure and technology.
The research topics at the Institute for Human-Machine Communication deal with the fundamentals of a widely intuitive, natural, and therefore multimodal interaction between humans and complex information processing systems. All forms of interaction, i.e. modalities, that are available to humans, are to be investigated for this purpose. Both the machine's representation of information and the interaction technique is to be considered in this context, like
The improvement of a single method of interaction is important, but not our main goal of research. The coaction of different modalities is most promising to enhance the efficiency of Human-Machine Communication. Even a combination of only two modalities, e.g. speech and haptics, can be more efficient and less error-prone than a single way of interaction.
To ensure the end-user's acceptance of those new ways of Human-Machine Communication, it is of prime interest to investigate the usability of new interfaces in an early stage of development. This usability engineering process yields numerous hints to enable developers to produce a more efficient and less cryptic dialog between humans and machines.
Since the user's knowledge about a system and its properties changes over time and differs for each individual, it is advisable to create adaptive user interfaces. Therefore, our research also investigates the foundations of adaptivity and learning systems to enable us to develop man-machine interfaces that really take the user's variable skills into account.
Human-Machine Communication is an interdisciplinary field of research. Therefore many different subjects are involved to reach the long-term research objective of a natural, intuitive way of interaction with "machines". This chapter gives an overview of the current research topics at the Institute for Human-Machine Communication, but cannot be complete. Please see over the list of scientific publications in section 8.2 for an exhaustive view of our research work in the four-year period of this report.
Claus von Rücker
Human beings process several interfering perceptions at a high level of abstraction so that they can meet the demands of the prevailing situation. Most of today's technical systems are incapable of emulating this ability yet. Another problem of current systems is that due to their growing functionalism their interfaces are often complex to handle and require adaptation by the user to a high degree.
However, those interfaces would be particularly desirable whose handling can be learned in a short time and that can be worked with quickly, easily and, above all, intuitively. Therefore we propagate the design of multimodal system interfaces as they provide the user with greater naturalness, expressive power and flexibility. Moreover multimodal operating systems probably function more robustly than their unimodal counterparts because they integrate redundant information shared between the individual input modalities.
We are specially interested in the design of a generic multimodal system architecture which can easily be adopted to various conditions and applications and thus serve as a basis for multimodal interaction systems. In contrast to most of the existing architectures our design philosophy is to merge several competing modalities (i.e. two or more speech modules, dynamic and static gesture modules, etc.) instead of using a specially designed set of combined modalities.
To integrate the various information contents of the individual input modalities we are following a new approach. By employing biological motivated evolutionary strategies the core algorithm compromises a population of individual solutions to the problem at hand and a set of operators defined over the population itself. According to evolutionary theories, only the most suited elements in a population are likely to survive and generate offspring, transmitting their biological heredity to new generations and thus lead to stable and robust solutions.
The concepts of the developed multimodal system architecture are validated in various scenarios, both research and industry-sponsored projects. One special application for example enables the user to navigate in arbitrary virtual worlds by freely combining natural and command speech, dynamic hand gestures and conventional graphical interfaces
Due to the multiplicity of coexisting electronic devices in modern luxury and upper class automobiles, the limit of usability has been reached for the standard user. Some examples of these devices are navigation and telematics systems, audio and video components like cd-changer, radio and television, mobile phone, car computer, air condition and any additional conceivable unit such as internet applications. As the devices of different manufacturers not only look different, but also have company-specific user interfaces and control elements, these can hardly be operated by a standard user - particularly while driving.
For some years now, car manufacturers have been trying to solve this problem by developing multi-functional integrated man-machine interfaces (MMI), which employ a common graphical display and a reduced number of haptic control elements.
Reducing control elements for example increases the complexity of menu structures. The user still has to read lengthy manuals to find his way through the menus of the resulting user-interface. In order to make operating the MMI more intuitive, alternative basic approaches and studies are essential.
We consider the introduction of familiar human communication modalities like natural speech and gestures a precondition for an intuitive dialog with machines, which also applies to car MMIs. Additionally, the user should get further support by an adaptive, help (cf. section 4.3.2) and assistance system (cf. section 4.3.1). However, further problems appear.
How does such an integrated user interface have to be structured? What kind of modalities can or have to be used for which function? How and when should acoustical or visual feedback be given? How and when can the system sensibly adapt to the user? How can the user interface be designed user-friendly, intuitively usable, assisting (but not dominating) the user, all without drawing the drivers attention away from traffic?
Our research in this domain mainly takes place in our navigation-lab (see 2.2), a specially automotive-fitted usability-lab (see 2.1) which permits us to do usability tests in an automotive environment. The MMI is simulated by a computer. With the "Wizard-of-Oz methodology" we test new concepts in a kind of rapid prototyping, with a "wizard" observing the test person and controlling the MMI. The test person gets the impression of controlling the MMI with gestures, speech or haptics. Promising concepts are then transferred and implemented into reality.
Electronic acquisition of mathematical formulas via conventional tools is a time consuming and complicated task. Therefore, a soft-decision solution for online handwritten formula recognition was successfully demonstrated in a former project [97win2]. Recently we developed a novel approach towards a multimodal analysis of natural, especially speech and handwriting interaction, both being the fastest and most intuitive channels for entering mathematical expressions into a computer [00hun1]. We utilize an integrated, multilevel probabilistic architecture with a joint semantic and two distinct syntactic models describing speech and script properties, respectively. Basic arithmetic operations, roots, indexed sums, integrals, trigonometric functions, logarithms, convolutions, fourier transforms, exponentiations, and indexing (among others) are supported. Compared to classical multistage solutions our single-stage strategy benefits from an implicit transfer of higher level contextual information into the lower level segmentation and pattern recognition processes involved. For visualization and postprocessing purposes, a transformation into AdobeTM FrameMakerTM documents is performed.
The syntactic-semantic attributes of spoken and handwritten mathematical formulas are represented by the parameters of a so-called Multimodal Probabilistic Grammar. It combines properties of context-free phrase structure grammars with those of graph grammars by allowing for word-type, symbol-type, and position-type terminals.
The grammar is implemented into a single stage semantic decoder by means of a compact semantic representation called Semantic Structure S [99mue1]. It is given by a hierarchically structured combination out of a predefined inventory of Semuns s (semantic units) with corresponding types, values, and successor attributes, every unit referring to a certain mathematical operator or operand. On the syntactic level every Semun of a given semantic hypothesis is assigned to a so-called Syntactic Module (SM). It consists of an advanced transition network which enables two distinct stochastic processes: 1) transitions from one node to another, 2) emissions of spoken words, handwritten symbols, or local offsets between associated symbols or symbol groups [00hun2]. Transitions are responsible for modelling speaking and writing order, whereas emissions account for varying word, symbol, or position choice, respectively. All the necessary transition and emission probabilities as well as semantic type, value, and successor probabilities were estimated from training corpora obtained from separate speech and handwriting usability tests.
An extended Earley-type top-down chart parser performs a MAP (maximum a-posteriori) classification across all abstraction levels. On the signal near levels, the preprocessed speech or handwriting input sequences are rated using 30-dimensional phoneme based semi-continuous HMMs (Hidden Markov Models) or 7-dimensional DTW (Dynamic Time Warping) matching, respectively. In a one pass search algorithm all possible semantic hypotheses are successively tested until the best overall semantic representation of a given input is found. Due to a breadth-first search strategy, inline first-last processing is enabled.
The most significant advantage of a single stage semantic decoding architecture as it is used in this work results from the simultaneous evaluation of knowledge belonging to all the involved abstraction levels: Apart from self-focussing effects achieved by restricting the search process to locally consistent sub hypotheses only, semantically corrupt recognition results are prohibited due to the integrated expectation driven classification scheme.
The overall system architecture is sketched in fig. 6.
For evaluation purposes we performed independent test classifications in either modality. Fully spoken or handwritten realistic formulas were examined, yielding a structural recognition accuracy of 61.1 % for speech and 83.3 % for handwriting (note: these numbers refer to complete formula correctness). For the future we wish to support freely interfering speech and handwriting interactions including mutual coreferencing due to deictic wording and pen gesturing. To this end, the use of speech will presumably be focussed to subterm input and error corrections so that we anticipate a robust and approximately real-time forthcoming system performance. Via a back and forth transformation to FrameMakerTM's formula editor natural speech and handwriting interaction may also be complemented by conventional input modes in the future.
The FERMUS project was started in March 2000 in cooperation with the industry partners BMW AG, DaimlerChrysler AG, Siemens AG and Mannesmann VDO AG, with the primary intention of localizing and evaluating various strategies to analyze errors in information systems by using various modalities which are mainly recognition-based.
A special goal is to investigate how the robustness of technical systems can be enhanced by using multimodal information already on the recognition level. By developing determined dialogue-techniques and special strategies of system adaptation to the current situation and to the intention of the user we expect an additional increase of system functionality because of a dynamic specification. Moreover, the influence of emotions, especially with regard to stress-situations, is intensively researched to enable a reliable separation and - in some cases - transformation of unusable to usable information.
The primary test domain is the handling of diverse communication facilities in an upper-class automobile (such as radio/cd, telephone, internet etc.) in connection with the variety of potential error sources by internal and external troublesome side-effects. In this context we are specially interested in the modality specific impacts on the performance of the overall system. For the examinations a simplified man-machine interface which facilitates the operation of basic information- and communication-devices is used.
The project bases on information of already completed projects with various industry partners (mainly the BMW AG and the Siemens AG), particularly this holds for the examination results concerning adaption mechanisms as well as usability-tests with regard to multimodal communication.
Gestures and facial expressions are important components of interpersonal communication. Using these visual modalities, human-machine dialog too can be provided with a more natural and intuitive form. Here we will exclusively consider vision-based methods in order to achieve non-intrusive gesture and facial expression recognition. The major topics in this field of research are specification, implementation and evaluation of the human-machine interface and development of image-based methods for automatic gesture recognition. It has turned out that the user's gestures can be massively influenced by the graphical design of the visual interface, therefore it is very important to coordinate the menu-driven handling and the visual feedback in order to create a gesture optimized application. During the research and development process frequent usability-tests have been carried out, for which we principally use the "Wizard-of-Oz methodology" (c.f. section 4.1.2). Most human gestures are movements; therefore, a lot of information is transferred by motion.
The process of automatic gesture recognition is described in the following, fig. 7 shows the system overview. The whole system does not need any model of the target object. Thus the system can be easily transferred to other objects or domains. The object in question (e.g. hand or face) is separated from the background by spatial segmentation. This is done by color-segmentation in an uncluttered environment with defined lighting and a combination of low-level image-processing algorithms with object-tracking in cluttered scenes. The movement of the segmented object is then classified with stochastic models. Hidden Markov Models (HMMs) are used for this, which can reproduce non stationary temporal processes. The central problem with this model is to find suitable features for transforming the spatio-temporal image sequence (fig. 8, left) into a time sequence of feature vectors (fig. 8, right). Different features, extracted from object-area or object-contour, have been implemented and tested. A further problem is to separate the continuous video stream of the observation camera into meaningful sections and non-meaningful sections, like coincidental movements or pauses. Temporal segmentation is done with a fast, however little robust, two-level approach and a robust, single-level approach with large computational cost (HMM-based spotting). The recognition methods developed in this project allow - depending on the used features - segmentation and classification for each type of movement. Thereby feature extraction methods have been found, which are suitable for recognizing dynamic human gestures and mimic. A real-time demonstration system has been implemented, the complex functionality of which can be controlled exclusively with dynamic gestures [98mor1, 98mor2, 98mor3, 99mor1, P-L13].
Gesture recognition has been broadened to other domains like controlling electronic car devices (see ADVIA project, section 4.1.2) and navigation in VRML-worlds (see section 4.1.1) with specifically adapted, usability-tested visual interfaces and gesture vocabulary. In order to permit a more intuitive and human fitted handling of an application, the gesture recognition system has to be modified and improved. Currently, a gesture is defined as a hand movement with a defined start and end position. The response of an application takes place after the recognition of such a completed sequence. Analyzing everyday situations show that, in most cases this form of indirect manipulation is used with gestures. But in some cases, it is reasonable to analyze and process the visual input directly while detecting a directional hand movement, for example setting analogue quantities or moving objects to a precise point on the screen. The main problem here is to make the recognition system distinguish automatically between direct and indirect control modes. Further information sources in gestures are speed, amplitude and repetition frequency, which are not especially analyzed so far by the system. In addition to carrying important information about the amount the user wants to change a parameter by, these features carry information about the user's habits and emotional situation.
The information sources which need to be exploited for user adaption strongly depend on the goal of adaption and the application itself. Tutoring systems for example provide some kind of guidance or training on a special topic, the user's goal is obvious as he uses the tutoring system. Other applications allow a variety of different intentions and the complexity of these applications doesn't allow direct inference of the user intention. However, knowing the user intention is the key for an appropriate adaptive dialog and system modelling. To calculate an estimation of the user's goal, we developed two different approaches, both based on probabilistic networks, which provide methods of dealing with uncertain and incomplete information.
A plan recognizer is used to infer the user intention by considering recent user actions. Depending on the application, the user's plans differ often very much from optimal plans, i.e. users behave suboptimal. Our plan-based approach has been optimized for such applications, considering that even suboptimal behavior may support plan/intentions-hypotheses to a certain extent. Trying to determine the user intention merely from a small number of actions obviously entails a risk to fail, i.e. not to estimate the real user intention. To reduce the risk of the system not to offer help or assistance for the real user intention, a help system was developed that is capable of creating help texts according to a number of nearly equally likely plans without overstraining the user's cognitive capabilities.
The second approach has been developed for scenarios/applications with varying situations entailing changes of the user's preferences and intentions. In a car the driver's intentions and preferences strongly depend on location, weather, speed, traffic and inmates, for example. Therefore a number of different probabilistic expert systems based on probabilistic networks have been created which are capable of learning typical user behavior and intentions according to the current situation. Real-time online training of the influence of each situation parameter on the user behavior allows the expert system to infer the user intention in similar situations. As a result, reasoning about a certain situation doesn't require the expert system to be trained to that exact situation.
Having an estimation of the user intention offers a lot of potential for enhancing the human-machine dialog. For example knowing the user intention helps to reduce system requests and to formulate system requests in regard of the situation and the user's goal. Additionally we use the estimation of the user intention not only for user adaption, but also for a user and situation specific evaluation of pattern recognition-based inputs to improve recognition rates.
In order to facilitate the interaction with modern software systems, an adaptation to the user is to be achieved. Therefore the system must be capable of gaining and valuing information about the user. Fundamental methods and approaches to the adaptation of a software system to a user can be developed and demonstrated best within the area of "intelligent help and tutoring systems". Central viewpoint is the modelling of the knowledge status of the user. We examine both statistical and rule-based methods for learning user models, which are to enable an independent, user-adequate adaptation of the help or tutoring system. Essentially the architecture underlying the system consists of a combination of neural networks and fuzzy logic, one speaks also from "neuro-fuzzy and soft computing".
In the first phase of the system creation (training) we use the information won from user actions and afflicted with a factor of uncertainty (preprocessing) in order to generate and train an easily modified neural network. Depending upon application a generalized regression network, a probabilistic neural network, a competitive neural network or a self organizing map is taken as a basis. In order to model different characteristics of the users optimally, also a combination out of several networks can be used. In the phase of use it is now possible with this network to classify a user in different categories to give him appropriate assistance embedded in the context. With the user actions evaluated in this phase we re-train the network online. Thus an by and by increasing adaptation of the system to the individual user is achieved.
We develop and test the fundamental architecture in different projects. As a part of the ADVIA project (see 4.1.2) an adaptive help system is created as supporting introduction to the gesture controlled operation of an MMI in the vehicle. In order to infer the neediness and the knowledge status of the driver, in particular the execution quality of the gestures (confidence measure of the gesture recognition), the execution duration of the gestures, as well as the number of assistance request, in each case as a function of the context are analyzed here. Our goal is to support the user with an automatic, but unobtrusive help to avoid or minimize the need of help requests. Beyond that an expansion of the help system is conceivable on the entire, multimodal operation of the MMI. Furthermore we develop a tutoring system for an introduction in creating HTML-pages.
Nowadays, it is well known that emotions play a fundamental role in perception, attention, reasoning, learning, memory, decision making and other human abilities and mechanisms we generally associate with rational and intelligent behavior. In addition to verbal communication, nonverbal communication represents an important factor to enable a reasonable, purposeful and efficient interaction between people. Beside these communicative functions, nonverbal behavior and especially facial expression can provide information about the current affective state of a person. Correspondingly, we can't afford to neglect emotions anymore, if we want to develop future computer systems capable of solving complex problems or to interact with humans in an intelligent way, that means to create human-like behavior in machines.
Numerous questions to be dealt with are raised in the context of emotions in human-machine communication: which emotions occur when humans interact with computer systems, how can they be recognized, which stimuli cause these emotions, in which way should a computer system respond to emotional reactions of the user, etc. Once these problems get solved, the computer system capable of emotions would register the signals sent out by the user, recognize the inherent patterns and assimilate this data in a model of the user's emotional reactions. Then the system would be able to transmit useful information about the user, for example to applications that can use such data.
The probably largest application area might be future generations of human-machine interfaces, which will be able to recognize the emotional states of the users and react to them adequately. For example, if an user becomes frustrated or annoyed while interacting with an application, the system might respond to these emotional states of the user, preferably in such a way, that the user would sense it as intuitive. This way, following the interaction between humans, it might be possible to achieve a more natural human-machine interaction.
By integrating emotions as an additional modality to haptics, speech, gesture, etc. it might be possible to improve the error robustness of the control of technical systems. Within the framework of the FERMUS project (as described in section 4.1.4) research on this topic will be done. Thereby, the operation of communication devices in automobiles like phone, radio/cd, internet will serve as application area. Emotional states of the user will be used as a supplementary control element to achieve a better adaptation of the system to the current situation of the user. Within the studies it is of particular interest to detect in which case system errors lead to negative emotional reactions of the user and how it is possible to avoid them or at least to weaken them.
In many applications of modern human-machine interaction it seems to be necessary to allow continuous speech as input and output of the system. The typical task of continuous speech recognition consists in evaluating the phonetic information of an utterance (e.g. a sentence) and to represent the result as a chain of words which are defined in a lexicon. This task usually is carried out on the basis of single procedures or "modules" which perform the preprocessing of the acoustic speech signal, the application of phonetic units (phoneme models), classification of words, utilizing syntactic constraints, recognizing the complete sentence, and analyzing the semantic meaning of the spoken utterance with respect to a given, well-defined task.
Since speech sounds are characterized especially by their spectral properties, a preprocessing step calculates short-time spectra in time intervals of about 10 ms. Beyond that, it is advantageous to take into account the properties of the auditory system. For this purpose the Bark- or mel-scale should be chosen instead of a linear frequency axis. A basic model of human auditory frequency analysis can provide so-called "loudness spectra" which are especially suited for speech recognition.
A fundamental problem are the coarticulation effects which cause a strong influence between neighboring speech sounds. As a result, the feature vectors can be strongly dependent of the phonetic context, so that a phoneme unit cannot be described by a single pattern. In our work we tried to use syllables or parts of syllables as decision units, these are syllabic consonant clusters and syllabic nuclei (vowels and vowel clusters). By use of these clusters the main coarticulation effects are contained within the units. As an alternative, so-called triphones can be introduced which constitute each phoneme together with a specific left and right phoneme context.
The classification of phonemes usually is based on stochastic modeling by means of "Hidden Markov Models" (HMM). These models consist of states within a left-to-right Markov graph, whereby the states contain the distributions (pdf) of the feature vectors observed in these states. During the training step these distributions are determined from a large training set. The research work nowadays is concentrated on fast estimation methods and the adaptation of the HMMs to new speakers and new environments.
It seems rather attractive to combine the HMM approach with Neural Nets (NN) giving hybrid models. The pdfs within the HMMs can be properly calculated by means of NNs. In this way it is possible to get a discriminant representation of the feature vector distributions. These NNs can also favorably be used for determining the syllable nuclei and for measuring the speaking rate.
It is important to allow for pronunciation variants, which may be dependent on the speaking style and the speaking rate. These variants can be determined by inspection of large speech corpora or on the basis of rules.
Speech understanding can be logically divided into a speech recognition phase and an interpretation phase. It is important to evaluate both tasks in common. This means, the search for the chain of spoken words and the corresponding semantic structure is carried out simultaneously. This is facilitated if the semantic structure is described by stochastic modeling techniques, too. In our work the semantic structure is built up by semantic units which contain the single parts of the semantic meaning. The semantic understanding module describes the syntactic and semantic structure within a probabilistic framework. Thus, recognition and understanding are able to yield a common optimal result.
The processing of spontaneously spoken human-to-human dialogues is a special challenge for automatic speech recognition systems. Compared to read speech, where the speaking mode is well defined, in spontaneous speech several sources of variability contribute to changes in the speech signal. Among others, the speaking rate is an important factor influencing the speech signal. Unlike human listeners, which can cope with a large range of speaking rates without any problem, state of the art automatic speech recognition systems show severe degradations in recognition performance when the speaking rate is higher than normal. Therefore several approaches to improve robustness of hidden Markov model (HMM) based automatic speech recognition systems towards speaking rate were evaluated.
The implications of the speaking rate on the speech signal can be divided into three categories. There are timing effects, acoustic-phonetic and phonological effects resulting from varying speech rates. Whereas HMM-based recognizers can handle different lengths of phonetic units without any problem, the speech rate specific changes in the spectral domain as well as the use of variants differing from the standard pronunciation in general lead to higher confusions in the pattern recognition process. To be able to capture these effects, the corresponding knowledge sources of the recognizer have to be adapted. These are the acoustic models which represent the spectral characteristics of the phonetic units and the pronunciation dictionary which includes the pronunciations of the vocabulary of the recognition task. In comparison to speaker adaptation this adaptation problem is more difficult to handle, as the speaking rate can change even within one sentence, whereas the speaker's identity remains constant.
Firstly, in an explicit adaptation strategy comprising a rule and feature based estimation of the speaking rate [98pfa1] was combined with a switching between several speech rate specific recognition systems. A retraining of the acoustic models as well as changes in the pronunciation dictionary were performed to build the speech rate specific recognizers. Special emphasis was put on robust reestimation procedures for HMM-parameters such as maximum aposteriori (MAP) training [98pfa2]. This is a very important issue as the speech rate specific speech material for parameter reestimation is generally very limited.
Secondly, different types of normalization procedures were carried out to reduce the variations which have to be captured by the parameters of the HMMs. It proved to be useful either to reduce speech rate specific or speaker specific variations [99pfa1] as well as to capture pronunciation variants [99pfa1] to increase robustness towards speaking rate. Finally, it was also shown that the improvements of the different normalization procedures are almost cumulative [00pfa1].
In the past years there has been tremendous progress in the performance of speech recognition systems. Nevertheless there is still a variety of non-solved problems, such as different speakers, speaking style (fast, slow), noise.
It is well known, that speaker-dependent (=specifically trained for a speaker) systems outperform their speaker-independent counterparts. So in the recent years there has been a turnaround in the design of speech recognizers: most of them now inhere some kind of speaker-dependency.
Especially for closed-speaker systems, i.e. systems with a limited number of users, a combination of speaker identification and speaker-adaptation is very effective. The identification part allows the system to determine which speaker is currently using the recognizer, enabling the system to successively collect enrolment data for this particular speaker. By this way the system can be adapted by-and-by to the individual users, asymptotically reaching the performance of a speaker dependent system.
The key part of an identification system are the speaker models. Commonly, either Hidden Markov Models or some kind of vector quantized (VQ) codebooks can be used. A main advantage of the latter models is the fact, that they do not rely on any phonetic segmentation of the speech signal, a fact that is particularly advantageous for on-line application. So these models can be trained on whole utterances without the need of a phonetic segmentation. Such codebooks are usually built of K centroids, which can be trained using clustering techniques or sophisticated discriminative training algorithms.
With an increasing number of speakers the computational load becomes more excessive in the search process since all models have to be calculated in parallel. In this case efficient pruning strategies for the discarding of improbable speakers have to be applied. Another effective strategy lies in the application of tree-based speaker clusters, which can be pre-computed on the training data.
By collecting more and more enrolment data, the actual Hidden Markov Models can be adapted towards the speakers and their mode of speaking. The adaptation itself can be performed using training algorithms such as MLLR (Maximum Likelihood Linear Regression) in case the enrolment data is still sparse. If more data is available, Bayesian techniques like MAP (Maximum Aposteriori) or discriminative algorithms like MCE (Minimum Classification Error) or MMI (Maximum Mutual Information) can effectively be applied.
Speech is considered as the most common and therefore natural communication medium between human beings and offers advantages like hand-freeness and the ability of working in the dark when being used as an input medium. State-of-the-art speech recognizers in general allow only single-word commands or phrases containing a certain keyword. The research work carried out at the Institute for Human-Machine Communication led to a different approach which encourages the user to speak spontaneously in a most natural manner without the requirement of any learning process [P-L4, P-L5, 97mue, 98mue3, 97sta]. The only restriction of our system is the limitation to a single domain and a single language at a time. An average recognition rate of the right user's intention of 90% could be reached in different domains [98mue1].
The core of the system is realized in a top-down architecture of a one-stage maximum a-posteriori semantic decoder and a signal preprocessor. The stochastic semantic decoder utilizes pre-trained probabilistic knowledge on the semantic, syntactic, phonetic and acoustic levels. An integrated chart parser makes use of the Viterbi algorithm, calculating semantic, syntactic, and acoustic probabilities on the basis of Hidden-Markov-Models (HMM) and similar network structures. Its output, a semantic structure, is the input for a rule-based intention decoder which is placed "on top" of the system core. This decoder communicates bi-directional with an external application and allows for online control, provided that the application owns a well-defined command interface. One of the basic advantages of such an approach is the portability to different domains. Front-ends for several applications have already been successfully realized following this principle, e.g. for a graphics editor, for a service robot, for medical image visualization, for scheduling dialogues and for a mathematical formula editor. It will also be applied in the ongoing projects "speech in the automotive environment" (c.f. 4.1.2) and "navigation in VRML worlds" (c.f. 4.1.1).
Even the difficult task of automatic translation was successfully examined with this approach [99mue1]. The speech understanding system plus a language production module are capable of translating German phrases of a special domain into English, French or other languages. This module contains a word chain generator with the syntactic model and a linguistic postprocessor including grammar rules and the inflection model of the target language. Another add-on is an automatic HMM based speech detection module. It allows the user to speak at any time without pushing any button to indicate the beginning or end of dialogue instances.
Future research will deal with unknown words, confidence measures and the idea of integrating the intention decoder into the semantic decoder. The latter idea promises higher recognition performance due to the more abstract and less ambiguous nature of the intention compared with the semantic structure. We also aim to exploit the detailed contextual knowledge obtained from the current dialogue state to integrate constraints for even more robustness.
Noise and vibration in the passenger compartment of a vehicle contribute substantially to the overall impression of a cars quality and therefore have a great influence on the buying decision of a customer. Because the time to market is constantly decreasing while the pressure to save costs is increasing, the developing engineers seek to define and optimize a vehicles vibrational and acoustic comfort characteristics as early as possible - even before the first prototype has been built. Due to the advances in the field of numerical structural acoustics and in computer technology, the finite element models used for this type of calculation have increased in size continuously, nearly overcompensating the fast growing performance of modern computers. Therefore one major deficiency of the FE-method are the very long computing times, which are in the range of several hours for a simple frequency response function up to several days for an optimization task even on a state of the art super-computer.
This fact, together with the batch-oriented computing style and the unwieldy and non intuitive user interfaces of commercial FE-calculation programs prevent a flexible and creative employment of FE-methods and a detailed understanding of the structural-acoustic coupling phenomena.
Using a new approach for the description of the vibro-acoustic equations of motion of model variants the computing time for modifications and optimization of coupled fluid-structure systems could be drastically reduced. The so-called modal correction technique allows to implement the entire optimization loop using generalized coordinates for the description of the dynamic state of the system, thus reducing the problem size roughly by a factor of 1000. Additional decrease of problem size and computation time was achieved applying an adaptive mode reduction technique during the solution stage of the modal equations and by streamlining the dataflow and workflow.
The improvements render it possible to carry out vibro-acoustic calculations on a workstation with FE-models, which up to now could only be handled by super-computers. As a consequence, it is now possible to run the FE-calculations and optimizations in a quasi on-line, interactive way. This is still between 5 up to 120 times faster than the conventional approach on a super-computer.
To demonstrate the benefits of this new approach a software called VAO (Vibro-Acoustic Optimization) was developed, which offers a full palette of tools for the visualization, investigation, modification and optimization of coupled structure-fluid systems.
In most countries, the noise level of vehicle exterior noise is regulated by legislative limits. These restrictions have been intensified over the last two decades and further reductions of exterior noise level will be ratified. This trend requires much more acoustical optimization of cars. On the other hand, development periods in the automotive industry decrease, and a modification in the pre-production state is hardly possible. To meet this challenge, the department develops a new method for analyzing the exterior noise in cooperation with the BMW Group Munich. The method provides an easy way for calculating the exterior noise of a vehicle in the early development stage.
The ISO R 362 regulation states the required measuring procedure of the vehicle exterior noise. The car has to accelerate on a 20 meters track, where a microphone measures the noise 7.5 meters beside the lane. According to this pass-by test procedure, the maximum noise level is restricted to 74 dB(A) . For acoustical analysis and developments, the BMW Group set up a pass-by test chamber, where the pass-by test can be simulated. In principle, it is the same situation as in ISO R 362, but with standing car and moving microphone . All necessary measurements will be performed in this test chamber.
The exterior noise is being generated by a couple of noise sources like muffler, orifice, engine, etc. In order to reduce the exterior noise level effectively, the noisiest source has to be reduced. Therefore it is necessary to know the noise ranking of all components. A lot of effort has to be spent to determine this noise ranking . The new method offers a nice and quick solution for this task . It is divided into 3 main steps:
The noise sources will be replaced by loudspeakers (see fig. 10). Each loudspeaker is driven with white noise and the transfer function from each speaker to every microphone will be detected. The loudspeaker parameter for the transfer functions will be the membrane acceleration aj :
The principle idea is to replicate the operational sound field by loudspeakers. In order to synthesize the measured microphone sound pressure (step 1), the necessary speaker adjustments have to be calculated. The sound pressure synthesis will be virtually only. Therefore, in step two the transfer function of each loudspeaker is measured and in step three all loudspeakers will be adjusted and activated simultaneously. In this way, each speaker sounds like the component which it has replaced.
With this result, the noise ranking of all noise sources during the pass-by test can be determined. Furthermore, the calculation of the pass-by noise of a car with modified sources is possible. The only adjustments to be done are the loudspeaker sounds, according to the modified components.
 Freymann, R.; Stryczek, R.; Riess, M. and Demmerer, S.: A new CAT-Technique for the Analysis and Optimization of Vehicle Exterior Noise Characteristics. ImechE Conf. Trans. "Vehicle Noise and Vibration 2000", London 2000.
The concept that physical noise evaluation has to be based on features of the human hearing system was further put forward and described in detail in invited review papers [96fas2, 97fas1, 97fas3, 97fas4, 99fas1, 00cha1]. The description of noise emissions by loudness as standardized in DIN 45 631 is nowadays commonplace in most acoustic labs worldwide.
On the other hand, a firm psychoacoustic basis had to be established with respect to the physical measurement of noise immissions. Numerous studies were performed on industrial noise [97ste2] and traffic noise [98got2, 99got1, 00kuw1]. In particular, for the first time, the "railway bonus" could be confirmed also in laboratory studies [98fas1, 00kuw1]. In addition, the noise immission from leisure noise was studied for the example of tennis noise [98ste1, 99fil1].
As a global result it turned out that the percentile loudness N5, i.e. the loudness which is reached or exceeded in 5% of the measurement time, is a good indicator for the impact of noise immissions [00fas10]. Even the effects of "railway bonus" and "aircraft malus" can be predicted on the basis of N5 [00fas7].
Despite the fact that loudness constitutes a dominant feature in the rating of sound quality, in particular for sounds with similar loudness, other hearing sensations like sharpness, fluctuation strength, or roughness may play an important role [00fas11].
For physical evaluation of sounds, new metrics were proposed which are described in an overview paper [98fas4]. In particular with respect to temporal aspects, signal processing algorithms in loudness analysis systems have to mimic in great detail features of the human hearing system [98wid1, 98wid2, 98wid3].
With respect to loudness of stationary sounds, data of different analysis systems available on the market are in good agreement with deviations of less than 5 % [97fas2]. In contrast, with respect to temporal processing, huge differences may occur for instruments of different manufacturers according to the degree of sophistication of the algorithms implemented [98fas3].
As concerns the physical measurement of noise immissions, statistical procedures were developed and implemented, which predict the accuracy of measurement, which can be achieved in a given measurement time [97ste4].
A new, updated and extended edition of the book "Psychoacoustics - Facts and Models" was published. Some older material was re-arranged, text was adapted to current terminology, and new results were added, in particular in the chapters on pitch, fluctuation strength, roughness, and practical application (Zwicker and Fastl [99zwi1]).
The hearing sensation "pitch strength" was studied in great detail and the results are compiled in an overview paper (Schmid [99sch1]). Among other things it could be shown that modulations enhance the pitch strength of low pass noise [98sch3], but reduce the pitch strength of pure tones [97sch3]. For complex tones, the interaction of "pointing tones" and pitch strength was studied in detail [97cha, 98sch1, 98sch2]. Correlations between pitch strength and frequency discrimination [98fas6] as well as effects of vibrato on the pitch strength were established [98hut1].
With respect to the Zwicker-Tone it could be demonstrated that also combinations of pure tone plus lopes noise, pure tone plus band pass noise, as well as pure tone plus band stop noise can produce a Zwicker-Tone [00fas11]. Current psychoacoustic models of the Zwicker-Tone could be confirmed. In addition, neurologically based models of the Zwicker-Tone nicely account for the psychoacoustical facts (paper in preparation).
For the specification of the sound quality inside future high speed trains, tonal components produced by the motors or corrugated rails can play an important role. Therefore, in psychoacoustic experiments, the dominance of tonal components at 630 Hz or 1250 Hz was assessed [99hut1]. For an increase of the corresponding 1/3-octave band by 20 dB, a clear tonal character is audible which is only half as pronounced for an increase of 12.5 dB at 630 Hz or 10 dB at 1250 Hz. In line with the expectation, no tonal quality is perceived, if the 1/3-octave band in question is not enhanced. However, a decrease of sound energy in an 1/3-octave band by 20 dB can also produce a faint tonal sensation with a magnitude of about 1/10 of the tonal sensation produced by an increase of 20 dB. The results obtained with stimuli simulating the sound quality inside high speed trains are in good agreement with data from basic psychoacoustic experiments. Therefore, it is expected that sound quality evaluation of high speed train indoor noise can profit from a multitude of psychoacoustic data available.
Another aspect investigated is the disturbance of privacy in high speed trains which has become more evident in recent years due to successfully performed noise reduction measures [00pat1]. In psychoacoustic investigations, the contradictory requirements of unwanted speech intelligibility disturbing privacy on the one hand, and the desired sound quality on the other hand, were assessed. Early results show that in order to ensure privacy in large cabins of high speed trains across the tiers and at the same time not to reduce sound quality much, intensive shielding measures would be necessary. By means of basic psychoacoustic tools quantitative results are received by which a reasonable cost-benefit calculation could be established.
The method of "line length" which has proven very successful for the evaluation of noise immissions was adapted for its use in audiology [98got1]. In comparison to the presently used categorical scaling [97bau], advantages for clinical applications were verified.
For the Ukrainian language, a speech test was developed, realized and recorded on CD [98cha1]. Moreover, the intelligibility of monosyllables in background noise was tested for the languages German, Hungarian, and Slovene [97ste1].
For patients with Cochlea-Implants, their ability to understand speech was tested both in quiet and in background noise [98fas2]. While in quiet surroundings, speech perception of Cochlea-Implant-Patients can be restored nearly to normal, in noisy environments they experience extreme problems and need in comparison to normal hearing persons more than 15 dB better signal to noise ratio [98fas1].
Nowadays about 12 % of the population in Germany suffer from hearing disorders. In order to develop signal processing algorithms and fitting procedures for hearing aids, detailed knowledge about perceptual consequences of hearing impairment is very important. However, this knowledge still is far from being complete.
Therefore, based on results of psychoacoustic experiments, Zwicker's model of loudness [99zwi1] was modified such that also the loudness for hearing impaired listeners can be predicted. This is achieved by fitting the loudness function to a specific hearing loss. Since the specific loudness time pattern can be regarded as an aurally adequate representation of sound, various other psychoacoustic hearing sensations like "fluctuation strength" can be modeled on the basis of this pattern. Results from hearing experiments concerning "loudness fluctuation" - which is nearly the same as "fluctuation strength" - can be accounted for by the same model for normal and hearing impaired listeners, if loudness fluctuation is calculated from the modified specific loudness time pattern [99cha1, 00cha2].
From the specific loudness functions of normal and hearing impaired listeners, input-output functions for a signal processing system, which simulates hearing impairment, are deduced [00cha1]. Time-frequency analysis and synthesis within this system is done by the Fourier Time Transformation and its inverse [1, 98mum1]. Simulation of hearing impairment is helpful in evaluating hearing aid algorithms and fitting procedures for certain hearing losses as well as for providing normal hearing listeners with realistic demonstrations of auditory consequences of hearing impairment.
Moreover, in order to improve speech intelligibility in noise for hearing impaired listeners, the influence of so called "psychoacoustic processors" on speech intelligibility was assessed [00cha3] and - in cooperation with "Gasteig München" - magnetic induction loops, which are commonly used for speech transmission in churches and concert halls, were optimized [00cha4].
The main subject of research in the field of Acoustical Communication is the relation between the properties of acoustical signals on one hand, and the sensory analysis techniques of the human ear on the other hand. The book "Akustische Kommunikation", published by Professor Ernst Terhardt in 1998, treats almost any aspect of this interdisciplinary field of research and includes an audio-CD with demonstrations of various phenomena of auditory perception [98ter1].
In the four-year period of this report, research focused on the foundations of ear-adapted spectral representations, their relation to hearing perception, and their technical application. The following paragraphs give an overview of the research topics.
A system of linear filters was designed for modelling auditory preprocessing of sound [97ter]. It consists of a filter that accounts for the first two resonances of the ear-canal, and a filterbank that accounts for the spectral analysis of the inner ear. The design of the cochlear filter, which forms an element of this filterbank, takes advantage of knowledge about properties of the ear, in particular the threshold of hearing and the characteristics of tuning curves. Since the signal processing of the system allows easy digital computation, it is especially suited to be used as a front-end for models of more complex features of auditory perception.
Speech coding algorithms were developed and investigated that are based on contours of auditory spectrograms, which are computed with an advancement of the Fourier Time Transformation (FTT) [1, 98mum1]. These contours are defined as "ridges" in the 3D-representation of the magnitude spectrogram. Contours bearing relevant information correspond, among other, to part tones perceptible by the ear. Starting off from a known representation, additional ridges ("time contours") and a new signal reconstruction procedure are introduced. A classification of ridges allows to separate tonal from noise-like signal components. Applying these foundations to data reduction algorithms, speech codecs with data rates down to 4 kbps were presented and evaluated.
Models and algorithms were developed to extract those sound signal parameters that are characteristic for the specific sound of a piano tone and its quality [98val1, 98val2, 00val1]. Just as the speech coding application mentioned before, the algorithms rely on the significance of contours of FTT spectrograms. Listening experiments were carried out to investigate the perceived dissimilarity and the quality of the sound of different piano tones. The results were utilized to design a model for the sound quality of piano tones and to validate the predictions. The new methods for the measurement of the discrimination criteria can suit the purpose of enhancing the quality of electronic and acoustic pianos and could be used for automatic quality control in musical instrument manufacturing.
A system for determination of the virtual and spectral pitches of non-stationary sound signals was developed [98rue1, 00rue1, 00rue2]. It allows for those essential properties of the human ear that can be observed in psychoacoustical experiments concerning pitch perception. Besides the elementary feature of frequency selectivity, mainly spectro-temporal contrast effects are to mention in this regard. These effects are not being addressed by previously known systems. The developed system is able to model both the existence and the prominence of perceived pitches in time-variant sound signals. Also this system uses contours of an auditory FTT spectrogram as a basis.
A method to resynthesize CD-quality sounds from their auditory FTT spectrogram images was found, suggesting sophisticated sound modification via image processing [98hor1]. In particular, resynthesis of specifically selected spectrogram areas has confirmed the validity of strong audio-visual gestalt analogies.
 see section 4.7.4.
[Previous] - [Table of Contents] - [Next]
© Lehrstuhl für Mensch-Maschine-Kommunikation, Feb. 2001