Multi-View Region of Interest Prediction for Autonomous Driving

Visual environment perception is one of the key elements for autonomous and manual driving. Modern fully automated vehicles are equipped with a range of different sensors and capture the surroundings with multiple cameras. The ability to predict human driver’s attention is the basis for various autonomous driving functions. State-of-the-art attention prediction approaches use only a single front facing camera and rely on automatically generated training data. In this paper, we present a manually labeled multi-view region of interest dataset. We use our dataset to finetune a state-of-the-art region of interest prediction model for multiple camera views. Additionally, we show that using two separate models focusing on either front or rear view data improves the region of interest prediction. We further propose a semi-supervised annotation framework which uses the best performing finetuned models for generating pseudo labels to improve the efficiency of the labeling process. Our results show that existing region of interest prediction performs well on front view data, but finetuning improves the performance especially for rear view data. Our current dataset consists of about 16000 images and we plan to further increase the size of the dataset. The dataset and the source code of the proposed semi-supervised annotation framework will be made available on GitHub and can be used to generate custom region of interest data.

Link to the dataset at mediaTUM

Room segmentation in point clouds

Emerging applications, such as indoor navigation or facility management, present new requirements of automatic and robust partitioning of indoor 3D point clouds into rooms. Previous research is either based on the Manhattan-world assumption or relies on the availability of the scanner pose information. We address these limitations by following the architectural definition of a room, where the room is an inner free space separated from other spaces through openings or partitions. For this we formulate an anisotropic potential field for 3D environments and illustrate how it can be used for room segmentation in the proposed segmentation pipeline. The experimental results confirm that our method outperforms state-of-the-art methods on a number of datasets including those that violate the Manhattan-world assumption.

LMT Texture Database

While stroking a rigid tool over an object surface, vibrations induced on the tool, which represent the interaction between the tool and the surface texture, can be measured by means of an accelerometer. Such acceleration signals can be used to recognize or to classify object surface textures. The temporal and spectral properties of the acquired signals, however, heavily depend on different parameters like the applied force on the surface or the lateral velocity during the exploration. Robust features that are invariant against such scan-time parameters are currently lacking, but would enable texture classification and recognition using uncontrolled human exploratory movements. We introduce a haptic texture database which allows for a systematic analysis of feature candidates. The database includes recorded accelerations measured during controlled and well-defined texture scans, as well as uncontrolled human free hand texture explorations for 69 different textures.

A dataset of thin-walled deformable objects

Datasets of object models with many variants of each object are required for manipulation and grasp planning using machine learning and simulation methods. This work presents a parametric model generator for thin-walled deformable or solid objects found in household scenes, such as bottles, glasses and other containers. Two datasets are provided that resemble real objects and contain a large number of variants of realistic bottles.

Video Synchronization Benchmark

This website provides a collection of user generated multi-viewpoint video sets (i.e. casual recordings of isolated events from multiple perspectives). Its purpose is to facilitate an objective performance evaluation of different video synchronization algorithms. The video collection covers 43 distinct events, recorded from 2 to 5 viewpoints each. In total, there are 164 video pairs whose relative temporal offsets are to be determined. All videos have been recorded with consumer grade cameras (camcorders and mobile phones) and under realistic conditions (shaking cameras, unconstrained viewpoints, etc.), rendering fully automatic synchronization a challenging task.