Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation

Yijie Qian^1,*, Juncheng Wang^2,*, Yuxiang Feng¹, Chao Xu³, Wang Lu⁴, Yang Liu^3,5,
Baigui Sun³, Yiqiang Chen⁴, Yong Liu^1†, Shujun Wang^2†

¹ Zhejiang University ² The Hong Kong Polytechnic University
³ IROOTECH & Wolf 1069 b Lab, Sany Group ⁴ Institute of Computing Technology, CAS ⁵ King's College London

^* Equal contribution ^† Corresponding author

arXiv Paper Project Code

Abstract

Current state-of-the-art paradigms predominantly treat Text-to-Motion (T2M) generation as a direct translation problem, mapping symbolic language directly to continuous poses. While effective for simple actions, this ``System 1'' approach faces a fundamental theoretical bottleneck we identify as the Semantic-Kinematic Impedance Mismatch: the inherent difficulty of grounding semantically dense, discrete linguistic intent into kinematically dense, high-frequency motion data in a single shot. In this paper, we argue that the solution lies in an architectural shift towards Latent System 2 Reasoning. Drawing inspiration from Hierarchical Motor Control in cognitive science, we propose Latent Motion Reasoning (LMR), a cognitive plug-in that reformulates generation as a two-stage ``Think-then-Act'' decision process. Central to LMR is a novel Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds: a compressed, semantically rich Reasoning Latent for planning global topology, and a high-frequency Execution Latent for preserving physical fidelity. By forcing the model to autoregressively ``reason'' (plan the coarse trajectory) before it ``moves'' (instantiates the frames), we effectively bridge the ineffability gap between language and physics. We demonstrate LMR's versatility by implementing it as a plug-in for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous). Extensive experiments show that LMR yields non-trivial improvements in both semantic alignment and physical plausibility, validating that the optimal substrate for motion planning is not natural language, but a learned, motion-aligned concept space.

Pipeline Overview

Overview of the proposed Latent Motion Reasoning (LMR) framework. The framework consists of two phases: (Right) Dual-Granularity (DG) Tokenizer: We explicitly disentangle motion representations into two manifolds: a compressed Reasoning Latent (Yellow), which is aligned with text embeddings to capture high-level semantic intent, and a high-frequency Execution Latent (Blue), which preserves low-level kinematic fidelity for reconstruction. (Left) LMR-Generator: We reformulate T2M as a hierarchical "Think-then-Act" generation process. Conditioned on the text prompt, the model first autoregressively synthesizes the coarse-grained reasoning tokens to establish the global motion topology (Thinking Phase). These tokens then serve as a stable semantic condition to guide the subsequent generation of fine-grained execution tokens (Acting Phase) via either Categorical or Diffusive sampling.

Quantitative results

Table 1. Comparison with diffusion-, BERT-, and GPT-type models of text-conditional motion synthesis on the HumanML3D and KIT-ML test set. ± indicates a 95% confidence interval. Among GPT-type methods, we indicate the best result in bold face, and the second best in underscore.

Table 2. Quantitative comparison on HumanML3D test set under the continuous motion representation setting (MotionStreamer). Real motion serves as an upper-bound reference. Lower FID and MultiModal Dist are better; higher R-Precision and Diversity are better. Among generated methods, bold denotes best, underline denotes second best.

Qualitative Results

Discrete Motion Representation Setting

Fig. 5. Qualitative comparison under the discrete motion representation setting. We compare our LMR framework against state-of-the-art T2M generators including MDM, MoMask, ParCo, BAMM, and T2M-GPT.

Continuous Motion Representation Setting

Fig. 6. Qualitative comparison under the continuous motion representation setting. We compare our LMR framework against the MotionStreamer baseline.

Text-to-motion Generation

A man in spinning in circles and then stops.

a man is doing jumping jacks .

a man walks forward and raises both his arms
and then drop his arms.

a person crawling from right to left
and vice versa.

a person does a salsa dance.

the person is moving from side to side.

a person is boxing, throwing various combinations
and demonstrating fighting footwork.

the person does a cartwheel.

a person walks around in a circle.

a person walks a few steps,
then begins to jog or run.

a person jumps sideways to their right several times, then several times to the left.

the person is running back-and-forth in a crescent shape.

Streaming Long-term Motion Generation

Comparison with Text-to-motion Models

Discrete Motion Representation Setting

Text Prompt: a man jumps once and then wobbles a little while moving legs apart .

MDM

Failure: keep jumping and no wobbles.

Momask

Failure: keep jumping and no wobbles.

ParCo

Failure: keep jumping and no wobbles.

Ours

Success: jump once and wobbles after jump.

Text Prompt: a man raises left foot knee high then swings out and puts down repaets this motion twice .

MDM

Failure: Misses knee raise, no swing out, no repetition.

Momask

Failure: Raises knee but misses swing, put down, and repetition.

ParCo

Failure: Does not raise knee, no swing out, no repetition.

Ours

Success: Accurately performs knee raise, swing, put down, and repeats twice

Text Prompt: a person puts one hand on their hip and the other in the air, then raises and lowers both arms together .

MDM

Failure: Achieves initial pose, but misses raising/lowering both arms together.

Momask

Failure: Achieves initial pose and arm lowering, but misses simultaneous arm raise.

ParCo

Failure: Achieves initial pose, but misses raising/lowering both arms together.

Ours

Success: Accurately performs all actions: initial pose, and raising/lowering both arms together.

Continuous Motion Representation Setting

Spatial Direction Understanding: Our LMR framework demonstrates superior ability to distinguish left/right spatial directions, accurately grounding body part references to correct limbs.

Text: a man kicks with his left leg.

MotionStreamer

✗ Incorrect

Ours

✓ Correct

Text: a man side steps to the left.

MotionStreamer

✗ Incorrect

Ours

✓ Correct

Complex Long Sentence Understanding: Our LMR framework excels at parsing complex, multi-clause instructions with sequential actions and spatial references.

Text: man reaches down to the left → reaches to the right → replaces to the left.

MotionStreamer

✗ Fails sequential pattern

Ours

✓ Correct left → right → left

Action Repetition Counting: Our LMR framework accurately captures numeric constraints (e.g., "once", "twice").
(Videos set to non-loop to avoid ambiguity. Use navigation buttons below.)

Example 1: "a small jump"
Example 2: "jumps twice"
Example 3: "jumps once"

Text: a person does a small jump.

MotionStreamer

✗ Multiple jumps

Ours

✓ Exactly one jump

Text: man jumps twice in place.

MotionStreamer

✗ Wrong number

Ours

✓ Exactly two jumps

Text: a person jumps up and down once.

MotionStreamer

✗ Multiple jumps

Ours

✓ Exactly one jump

Generation Diversity

Text Prompt: a person walks in a circle.

Ablation Study on Guidance Strategies

Text Prompt: a person puts one hand on their hip and the other in the air, then raises and lowers both arms together.

Baseline (w/o guidance)

✗ Failure: Semantic error - stretches waist left and right instead

+ TMR Embeddings

✗ Failure: No raises and lowers both arms together

+ Coarse-to-Fine AR

✗ Failure: Semantic error - stretches waist left and right instead

+ CLIP Token (L=77)

✗ Failure: No raises and lowers both arms together

+ Explicit Language CoT

✗ Failure: No raises and lowers both arms together

+ Reasoning Tokens (Ours)

✓ Success: Correctly performs all actions

Ablation Study on Reconstruction

Text Prompt: a person briefly strums a guitar.

GT

downsampling rate = 1

Details are well restored, close to GT.

downsampling rate = 2

Obvious flickering at the beginning; slight swing in the left hand.

downsampling rate = 4

Right hand's string-plucking action was not reconstructed.

downsampling rate = 8

Right hand's string-plucking action was also not reconstructed.

Ablation Study on Generation

Text Prompt: a person doing a spesific moves with legs and hands while doing boxing.

Ours
downsampling rate = 1
w. Chain-of-Thought (table2 row4)

downsampling rate = 1
single token (table2 row2)

downsampling rate = 4
single token w. Semantic(table2 row1)

downsampling rate = 2
single token

downsampling rate = 4
single token

downsampling rate = 8
single token

BibTeX

@misc{qian2025thinkmovelatentmotion,
      title={Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation}, 
      author={Yijie Qian and Juncheng Wang and Yuxiang Feng and Chao Xu and Wang Lu and Yang Liu and Baigui Sun and Yiqiang Chen and Yong Liu and Shujun Wang},
      year={2025},
      eprint={2512.24100},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.24100}, 
}

Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation

Abstract

Pipeline Overview

Quantitative results

Table 1. Comparison with diffusion-, BERT-, and GPT-type models of text-conditional motion synthesis on the HumanML3D and KIT-ML test set. ± indicates a 95% confidence interval. Among GPT-type methods, we indicate the best result in bold face, and the second best in underscore.

Qualitative Results

Discrete Motion Representation Setting

Fig. 5. Qualitative comparison under the discrete motion representation setting. We compare our LMR framework against state-of-the-art T2M generators including MDM, MoMask, ParCo, BAMM, and T2M-GPT.

Continuous Motion Representation Setting

Fig. 6. Qualitative comparison under the continuous motion representation setting. We compare our LMR framework against the MotionStreamer baseline.

Text-to-motion Generation

A man in spinning in circles and then stops.

a man is doing jumping jacks .

a man walks forward and raises both his arms and then drop his arms.

a person crawling from right to left and vice versa.

a person does a salsa dance.

the person is moving from side to side.

a person is boxing, throwing various combinations and demonstrating fighting footwork.

the person does a cartwheel.

a person walks around in a circle.

a person walks a few steps, then begins to jog or run.

a person jumps sideways to their right several times, then several times to the left.

the person is running back-and-forth in a crescent shape.

Streaming Long-term Motion Generation

Comparison with Text-to-motion Models

Discrete Motion Representation Setting

Text Prompt: a man jumps once and then wobbles a little while moving legs apart .

MDM

Momask

ParCo

Ours

Text Prompt: a man raises left foot knee high then swings out and puts down repaets this motion twice .

MDM

Momask

ParCo

Ours

Text Prompt: a person puts one hand on their hip and the other in the air, then raises and lowers both arms together .

MDM

Momask

ParCo

Ours

Continuous Motion Representation Setting

Text: a man kicks with his left leg.

MotionStreamer

Ours

Text: a man side steps to the left.

MotionStreamer

Ours

Text: man reaches down to the left → reaches to the right → replaces to the left.

MotionStreamer

Ours

Text: a person does a small jump.

MotionStreamer

Ours

Text: man jumps twice in place.

MotionStreamer

Ours

Text: a person jumps up and down once.

MotionStreamer

Ours

Generation Diversity

Text Prompt: a person walks in a circle.

Ablation Study on Guidance Strategies

Text Prompt: a person puts one hand on their hip and the other in the air, then raises and lowers both arms together.

Baseline (w/o guidance)

+ TMR Embeddings

+ Coarse-to-Fine AR

+ CLIP Token (L=77)

+ Explicit Language CoT

+ Reasoning Tokens (Ours)

Ablation Study on Reconstruction

Text Prompt: a person briefly strums a guitar.

GT

downsampling rate = 1

downsampling rate = 2

downsampling rate = 4

downsampling rate = 8

Ablation Study on Generation

Text Prompt: a person doing a spesific moves with legs and hands while doing boxing.

Ours downsampling rate = 1 w. Chain-of-Thought (table2 row4)

a man walks forward and raises both his arms
and then drop his arms.

a person crawling from right to left
and vice versa.

a person is boxing, throwing various combinations
and demonstrating fighting footwork.

a person walks a few steps,
then begins to jog or run.

Ours
downsampling rate = 1
w. Chain-of-Thought (table2 row4)

downsampling rate = 1
single token (table2 row2)

downsampling rate = 4
single token w. Semantic(table2 row1)

downsampling rate = 2
single token

downsampling rate = 4
single token

downsampling rate = 8
single token