Current state-of-the-art paradigms predominantly treat Text-to-Motion (T2M) generation as a direct translation problem, mapping symbolic language directly to continuous poses. While effective for simple actions, this ``System 1'' approach faces a fundamental theoretical bottleneck we identify as the Semantic-Kinematic Impedance Mismatch: the inherent difficulty of grounding semantically dense, discrete linguistic intent into kinematically dense, high-frequency motion data in a single shot. In this paper, we argue that the solution lies in an architectural shift towards Latent System 2 Reasoning. Drawing inspiration from Hierarchical Motor Control in cognitive science, we propose Latent Motion Reasoning (LMR), a cognitive plug-in that reformulates generation as a two-stage ``Think-then-Act'' decision process. Central to LMR is a novel Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds: a compressed, semantically rich Reasoning Latent for planning global topology, and a high-frequency Execution Latent for preserving physical fidelity. By forcing the model to autoregressively ``reason'' (plan the coarse trajectory) before it ``moves'' (instantiates the frames), we effectively bridge the ineffability gap between language and physics. We demonstrate LMR's versatility by implementing it as a plug-in for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous). Extensive experiments show that LMR yields non-trivial improvements in both semantic alignment and physical plausibility, validating that the optimal substrate for motion planning is not natural language, but a learned, motion-aligned concept space.
Overview of the proposed Latent Motion Reasoning (LMR) framework. The framework consists of two phases: (Right) Dual-Granularity (DG) Tokenizer: We explicitly disentangle motion representations into two manifolds: a compressed Reasoning Latent (Yellow), which is aligned with text embeddings to capture high-level semantic intent, and a high-frequency Execution Latent (Blue), which preserves low-level kinematic fidelity for reconstruction. (Left) LMR-Generator: We reformulate T2M as a hierarchical "Think-then-Act" generation process. Conditioned on the text prompt, the model first autoregressively synthesizes the coarse-grained reasoning tokens to establish the global motion topology (Thinking Phase). These tokens then serve as a stable semantic condition to guide the subsequent generation of fine-grained execution tokens (Acting Phase) via either Categorical or Diffusive sampling.
Failure: keep jumping and no wobbles.
Failure: keep jumping and no wobbles.
Failure: keep jumping and no wobbles.
Success: jump once and wobbles after jump.
Failure: Misses knee raise, no swing out, no repetition.
Failure: Raises knee but misses swing, put down, and repetition.
Failure: Does not raise knee, no swing out, no repetition.
Success: Accurately performs knee raise, swing, put down, and repeats twice
Failure: Achieves initial pose, but misses raising/lowering both arms together.
Failure: Achieves initial pose and arm lowering, but misses simultaneous arm raise.
Failure: Achieves initial pose, but misses raising/lowering both arms together.
Success: Accurately performs all actions: initial pose, and raising/lowering both arms together.
Spatial Direction Understanding: Our LMR framework demonstrates superior ability to distinguish left/right spatial directions, accurately grounding body part references to correct limbs.
✗ Incorrect
✓ Correct
✗ Incorrect
✓ Correct
Complex Long Sentence Understanding: Our LMR framework excels at parsing complex, multi-clause instructions with sequential actions and spatial references.
✗ Fails sequential pattern
✓ Correct left → right → left
Action Repetition Counting:
Our LMR framework accurately captures numeric constraints (e.g.,
"once", "twice").
(Videos set to non-loop to avoid ambiguity. Use navigation buttons below.)
✗ Multiple jumps
✓ Exactly one jump
✗ Failure: Semantic error - stretches waist left and right instead
✗ Failure: No raises and lowers both arms together
✗ Failure: Semantic error - stretches waist left and right instead
✗ Failure: No raises and lowers both arms together
✗ Failure: No raises and lowers both arms together
✓ Success: Correctly performs all actions
Details are well restored, close to GT.
Obvious flickering at the beginning; slight swing in the left hand.
Right hand's string-plucking action was not reconstructed.
Right hand's string-plucking action was also not reconstructed.
@misc{qian2025thinkmovelatentmotion,
title={Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation},
author={Yijie Qian and Juncheng Wang and Yuxiang Feng and Chao Xu and Wang Lu and Yang Liu and Baigui Sun and Yiqiang Chen and Yong Liu and Shujun Wang},
year={2025},
eprint={2512.24100},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.24100},
}