An overview of DPoser-X's versatility and performance across multiple pose-related tasks. Built on diffusion models, DPoser-X serves as a robust and adaptable prior for 3D whole-body human pose modeling. Shown are scenarios in (a) pose generation, (b) human mesh recovery, and (c) pose completion. With up to 61% improvement across 8 benchmarks, DPoser-X consistently outstrips existing priors like VPoser and NRDF, proving its superiority in tasks involving the human body, hand, and face.
Overview of the DPoser-regularized optimization framework. Panel (a) shows the unified framework from initial to final poses through DPoser-regularized steps. Panel (b) details the optimization step: task inputs (e.g., 2D keypoints in human mesh recovery) and current poses are used to compute the measurement loss based on the degradation pattern $ \mathcal{A}(\cdot) $ (e.g., camera projection). Meanwhile, DPoser regularization introduces noise to the current pose and applies a one-step denoiser to compute $ L_{\text{DPoser}} $. Our DPoser regularization encourages the current pose towards a more plausible pose distribution.
(a) Fused Whole-body Network: The DPoser-X architecture begins with frozen, pre-trained part-specific networks for the body, hands, and face. A separate fused module is then trained on top of these using whole-body datasets to learn the correlations and interdependencies between different body parts, such as the relationship between hand gestures and body posture during specific actions.
(b) Mixed Training Strategy: To improve generalization and prevent overfitting to limited whole-body data, DPoser-X employs a mixed training strategy. It utilizes a large corpus of part-only datasets (body, hand, face) by treating them as incomplete whole-body data and applying loss only to the available parts. To ensure the model can still predict complete poses, whole-body data is also used, sometimes with parts randomly masked, to train the network to fill in the missing information. This strategy results in the DPoser-X-mixed model, which balances realism with high diversity.
In body pose generation, DPoser generates visually diverse and realistic poses, indicating a well-learned prior distribution. In contrast, VPoser shows limited diversity due to its mean-centric nature, while GMM and Pose-NDF fall short in naturalism.
In hand inverse kinematics, DPoser maintains stability and precision, showcasing its superior ability to handle noisy input and recover plausible hand poses even under challenging conditions such as a) noisy keypoints, b) fingertip keypoints, c) partial finger keypoints, and d) sparse keypoints settings.
In whole-body pose completion, DPoser-X effectively learns the correlation between whole-body parts and demonstrates superior performance in completing missing body parts with natural and plausible results.
@article{lu2025dposerx,
title={DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior},
author={Lu, Junzhe and Lin, Jing and Dou, Hongkun and Zeng, Ailing and Deng, Yue and Liu, Xian and Cai, Zhongang and Yang, Lei and Zhang, Yulun and Wang, Haoqian and Liu, Ziwei},
journal={arXiv preprint arXiv:2508.00599},
year={2025}
}
The website template was adapted from HumanTOMATO Project.