Gesture and Image Sequence Recognition

The goal of this research project is the regognition of gestures in image sequences. The recognition system is capable to recognize 24 different gestures.
The database consists of 336 image sequences, that contain gestures of 12 different persons. For the training and the test of the system we recorded video sequences of 24 different gestures. The resolution of the video sequences was 96 x 72 gray-scale pixel and 16 frames per second. Each sequence consists of 50 frames, resulting in a sequence length of approximately three seconds.
Tab. 1 shows the differend gestures, an animated GIF can be viewed by clicking the images.

Hand-Waving-Both Hand-Waving-Right Hand-Waving-Left To-Right
To-Left To-Top To-Bottom Round-Clockwise
Round-Counterclockwise Stop Come Nod-Yes
Nod-No Clapping Kotow Spin
Go-Left Go-Right Turn-Right Turn-Left

Table 1: Gestures

The gesture recognition system contains following processing levels:


The preprocessing prepares the image sequence for the recognition by calculating the difference image sequence. The difference image sequence is calculated by subtracting the pixel value at the same position of adjacent frames of the original image sequence. The background and static parts of the body are eliminated. Thus the recognition system can work person and background independent.


An easy way to reduce noise in the difference image is to apply a threshold operation to the difference image. Every pixel with an absolute value smaller than the threshold is set to zero.


Fig. 1 shows the effects of the preprocessing on an example sequence.

Figure 1: Original sequence, difference sequence and
difference image sequence after thresholding

Feature Extraction

The feature extraction extracts the information, necessary for the identification of the gesture. For each image of the sequence a seven dimensional vector is calculated. This results in a vector sequence, where each vector carries important information about the current motion. The features in the feature vector are: Fig. 2 shows a original image sequence and a difference image sequence with a graph of the feature vector overlayed. The center of the ellipsis is the center of gravity. The main axes are the deviation from the center of gravity. The grayvalue of the ellipsis represents the intensity of motion.

Figure 2: original image sequence and difference sequence with
overlayed ellpsis

Statistical Classification

The recognition of gestures in video sequences can be interpreted as a dynamic pattern recognition problem. Hidden Markov Models (HMMs), well known from speech-recognition, offer superior pattern recognition capabilities for the dynamic case. Further advantages of using HMMs are, that segmentation and recognition takes place simultaneously, and that HMMs can be trained by a number of training samples similar to Neural Networks.
The system presented here uses 24 different HMMs for 24 different gestures. The HMM parameters were estimated with the feature sequences of the according gesture samples, applying the Forward-Backward algorithm.
With respect to the different types of gestures, like periodical or linear gestures, different HMM topologies are used. Linear gestures (gestures 17-24), like "go right" for instance are modeled with a linear topology (Fig. 3a), where each state has only self-transitions and transitions to the following state. Cyclic, or periodical gestures (gestures 1-16), like "round clockwise" or "pointing to top", which appear more than once in a sequence, are modeled with a cyclic topology, as shown in Fig. 3b.

Figure 3: (a) Linear and (b) cyclic model


To get the maximum amount of training and test data we used the hold out method. All sequences of one person were removed from the complete set of 336 sequences, and the recognition system was trained with the remaining 312 samples. For the test we used the 24 sequences that were removed from the complete set before. This process was repeated for all people. Finally we calculated the average recognition rate over all people. The overall recognition rate for this task is 92.9%.


The following people form the Institue of Computer Science from the University of Duisburg contributed this work: Andreas Kosmala, Stefan Eickeler (Diplomarbeit), Arno Römer (Studienarbeit), Holger Grütjen (Studienarbeit) und Abdelkader Mechrouki (Diplomarbeit)