Hand-Object Interaction (HOI) generation has significant application potential. However, current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data, limiting generalization capabilities. Meanwhile, HOI video generation methods prioritize pixel-level visual fidelity, often sacrificing physical plausibility. Recognizing that visual appearance and motion patterns share fundamental physical laws in the real world, we propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously. To integrate the heterogeneous semantics, appearance, and motion features, our method implements tri-modal adaptive modulation for feature aligning, coupled with 3D full-attention for modeling inter- and intra-modal dependencies. Furthermore, we introduce a vision-aware 3D interaction diffusion model that generates explicit 3D interaction sequences directly from the synchronized diffusion outputs, then feeds them back to establish a closed-loop feedback cycle. This architecture eliminates dependencies on predefined object models or explicit pose guidance while significantly enhancing video-motion consistency. Experimental results demonstrate our method's superiority over state-of-the-art approaches in generating high-fidelity, dynamically plausible HOI sequences, with notable generalization capabilities in unseen real-world scenarios. Project page at https://Droliven.github.io/SViMo_project.
Our method comprises: (1) A synchronous diffusion model that jointly generates HOI videos and motions (Sec. 3.3). (2) A vision-aware interaction diffusion model that generates 3D hand pose trajectories and point clouds from the former's outputs (Sec. 3.4), then feeds them back into the synchronized denoising process to establish closed-loop optimization (Sec. 3.5).
1. A novel synchronized diffusion model for joint HOI video and motion denoising, effectively integrating large-scale visual priors with motion dynamic constraints.
2. A vision-aware 3D interaction diffusion that generates explicit 3D interaction sequences, forming a close-loop optimization pipeline and enhancing video-motion consistency.
3. Our method generates HOI video and motion synchronously without requiring pre-defined poses or object models. Experiment results demonstrate superior visual quality, motion plausibility, and generalization capability to unseen real-world data.
Here we show some results in the paper. More vivid animation results are provided in the video above.
For video generation, Our method benefits from the synchronized modeling of visual and dynamic, resulting in the better Overall score. As for motion generation, our approach not only preserves input condition effectiveness through a triple-modality adaptive modulation mechanism, but also enhances object point cloud consistency with low-level visual priors.
@misc{dang2025svimosynchronizeddiffusionvideo,
title={SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios},
author={Lingwei Dang and Ruizhi Shao and Hongwen Zhang and Wei Min and Yebin Liu and Qingyao Wu},
year={2025},
eprint={2506.02444},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.02444},
}