SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

ArXiv 2025

¹School of Software Engineering, South China University of Technology
²School of Artificial Intelligence, Beijing Normal University
³Department of Automation, Tsinghua University

Abstract

Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.

Overall Architecture

Our SyncMV4D consists of two key components: First, the Multi-view Joint Diffusion (MJD) module generates synchronized multi-view color videos, intermediate motion pseudo videos, and metric depth scales (Sec. 3.3). Second, the Diffusion Points Aligner (DPA) module takes the resulting coarse 4D motions as a conditioning signal to reconstruct globally aligned 4D point tracks (Sec. 3.4). Furthermore, since both MJD and DPA are iterative denoisers, the refined 4D point tracks from DPA are fed back to guide MJD in subsequent denoising steps, forming a closed-loop mutual enhancement cycle (Sec. 3.5).

BibTeX

@article{dang2025syncmv4d, title={SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis}, author={Dang, Lingwei and Li, Zonghan and Li, Juntong and Zhang, Hongwen and An, Liang and Liu, Yebin and Wu, Qingyao}, journal={arXiv preprint arXiv:2511.19319}, year={2025} }

SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

ArXiv 2025

Abstract

Video

Overall Architecture

🔥Highlights

BibTeX