Marker-less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps


The recovery of 3D human pose with monocular camera is an inherently ill-posed problem due to the large number of possible projections from the same 2D image to 3D space. Aimed at improving the accuracy of 3D motion reconstruction, we introduce the additional built in knowledge, namely height-map, into the algorithmic scheme of reconstructing the 3D pose/motion under a single-view calibrated camera. Our novel proposed framework consists of two major contributions. Firstly, the RGB image and its calculated height-map are combined to detect the landmarks of 2D joints with a dual-stream deep convolution network. Secondly, we formulate a new objective function to estimate 3D motion from the detected 2D joints in the monocular image sequence, which reinforces the temporal coherence constraints on both the camera and 3D poses. Experiments with HumanEva, Human3.6M, and MCAD dataset validate that our method outperforms the state-of-the-art algorithms on both 2D joints localization and 3D motion recovery. Moreover, the evaluation results on HumanEva indicates that the performance of our proposed single-view approach is comparable to that of the multiview deep learning counterpart.

Results and code
ECCV16 Poster
ECCV16 Slides


We propose a novel framework for marker-less 3D human motion capture with a single-view calibrated camera, where the 3D human pose is articulated as 3D pose or a skeleton model parameterized by joint locations. It consists of three key components, namely height-map generation, 2D joint localization, and 3D motion generation.

The height-map is generated by existing height estimation algorithm [33] using calibrated camera parameters and the body silhouettes.(a) Illustration of height-map generation with pre-calibrated monocular camera, (b) Anatomical decomposition of Skeleton based on height [35].

Dataset Samples

We evaluate our approach on four datasets:
(1) the synthetic height-maps dataset, (2) HumanEva dataset [46], (3) Human3.6M dataset [45], and (4) Multi-Camera Action Dataset (MCAD) [41].

Quantitative Evaluation

Evaluation of 3D motion estimation on 3 subjects of the HumanEva dataset. The value in each cell are the RMS error and standard deviation in millimeter.

Walking Jogging
S1 S2 S3 Mean S1 S2 S3 Mean
[10] 99.6(42.6) 108.3(42.3) 127.4(24.0) 111.8 109.2(41.5) 93.1(41.1) 115.8(40.6) 106.0
[5] 71.9(19.0) 75.7(15.9) 85.3(10.3) 77.6 62.6(10.2) 77.7(12.1) 54.4(9.0) 64.9
Ours 62.2(18.6) 61.9(13.2) 69.2(22.4) 64.4 56.3(15.4) 59.3(14.4) 59.3(15.5) 58.3

Evaluation of 3D motion estimation on Human3.6M dataset. The error are reported in mean per joint position error (MPJPE) [45].

Directions Discussion Eating Greeting Phoning Photo Posing Purchases Sitting
LinKDE[45] 132.71 183.55 132.37 164.39 162.12 205.94 150.61 171.31 151.57
Li et al.[28] - 136.88 96.94 124.74 - 168.68 - - -
Ours 85.07 112.68 104.90 122.05 139.08 135.91 105.93 166.16 117.49

SittingDown Smoking Waiting WalkDog Walking WalkTogether Mean(6 actions) Mean(15 actions)
LinKDE[45] 243.03 162.14 170.69 177.13 96.60 127.88 160.00 162.14
Li et al.[28] - - - 132.17 69.97 - 121.56 -
Ours 226.94 120.02 117.65 137.36 99.26 106.54 118.69 126.47


    title={{Marker-less 3D human motion capture with monocular image sequence and height-maps}},
    author={Du, Yu and Wong, Yongkang and Liu, Yonghao and Han, Feilin and Gui, Yilin and Wang, Zhen and Kankanhalli, Mohan and Geng, Weidong},
    booktitle={European Conference on Computer Vision},