MoLingo: Motion-Language Alignment for Text-to-Motion Generation

arXiv 2025

1University of Tübingen, 2Tübingen AI Center, 3Max Planck Institute for Informatics, Saarland Informatics Campus,
4Imperial College London

Result Gallery

generated motion G1 tracking result

Abstract

We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. Recent works perform latent space diffusion, either on the whole latent at once or auto-regressively over multiple latents. In this paper, we study how to make diffusion on continuous motion latents work best. We focus on two questions: (1) how to build a semantically aligned latent space so diffusion becomes more effective, and (2) how to best inject text conditioning so the motion follows the description closely. We propose a semantic-aligned motion encoder trained with frame-level text labels so that latents with similar text meaning stay close, which makes the latent space more diffusion-friendly. We also compare single-token conditioning with a multi-token cross-attention scheme and find that cross-attention gives better motion realism and text–motion alignment. With semantically aligned latents, auto-regressive generation, and cross-attention text conditioning, our model sets a new state of the art in human motion generation on standard metrics and in a user study. We will release our code and models for further research and downstream usage.

Method

Description of the image

Left: Semantically aligned autoencoder architecture., The model comprises an encoder–decoder autoencoder for motion sequences and a parallel text-encoding branch that maps frame-level text labels into class tokens. A cosine-similarity loss is applied to align the motion latents with their corresponding class tokens. Right: Auto-regressive flow-based latent denoising. ur generation model uses a standard transformer decoder to obtain conditioning vectors, which guides an MLP in iteratively refining latents. During training the motion latents are randomly masked and replaced with learnable tokens. During inference, we initialize with fully masked latents, iteratively denoise them, and decode the final latents to obtain the generated motion.

Comparison with MARDM, ACMDM, and MotionStreamer

We compare visual motion generation performance across different methods given the same text prompt. All text prompts are sourced from the HumanML3D test set.

Motion Representation Format

  • MARDM uses a 67D motion representation. We generate motions in its native format and then run SimpLify to convert the results into SMPL mesh sequences.
  • ACMDM is based on pure absolute joint positions. Similar to MARDM, we implement joint-level generation and apply SimpLify for visualization.

    *We are aware that ACMDM can directly generate mesh-level sequences, but the mesh-generation checkpoint is not publicly released, so we implement the joint-level generation ourselves for a fair comparison.

  • MotionStreamer proposes a 272D representation. We directly extract the rotation components and perform the SMPL forward pass to obtain meshes.
  • Our Method uses the 263D representation most commonly adopted in the community and likewise applies SimpLify for visualization.

Referenced Methods:

MARDM: Rethinking Diffusion for Text-Driven Human Motion Generation. CVPR'25

ACMDM: Absolute Coordinates Make Motion Generation Easy. arXiv'25

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space. ICCV'25

A person touches knees to opposing elbows and then does squats.

MARDM

The elbow touching is not obvious.

ACMDM

Fails to perform the elbow touching motion.

MotionStreamer

Fails.

MoLingo (Ours)

Successfully generates the instructed motion.

This person jumps up and down on the right leg.

MARDM

Jumps with the wrong lead leg (left).

ACMDM

Jumps with the wrong lead leg (left).

MotionStreamer

Alternates legs and remains less faithful to the prompt.

MoLingo (Ours)

Follows the text most accurately with realistic motion on the right leg.

A person runs forward then turns completely and does a cartwheel.

MARDM

Does not turn around, and performs multiple cartwheels.

ACMDM

Does not run before turning around, and performs multiple cartwheels.

MotionStreamer

Fails.

MoLingo (Ours)

Follows the text best (run, turn, and single cartwheel) while preserving realistic motion.

The person is in a fight stance turns around to the right.

MARDM

The generated motion is realistic, but it abruptly freezes upon termination.

ACMDM

Fails to execute the turning action.

MotionStreamer

Fails to execute the turning action.

MoLingo (Ours)

Follows the text best.

The person puts the box down and runs.

MARDM

Executes the actions in the reverse order (runs first, then puts down the box).

ACMDM

Ignores the "putting down" action and only runs.

MotionStreamer

Fails.

MoLingo (Ours)

Follows the text instruction best, executing the multi-stage sequence (put down, then run).

Ablation Study

We evaluate the effect of our proposed Semantically Aligned Encoder (SAE) in this section. The comparison shows that a semantically aligned latent space enables better adherence to complex textual instructions.

MoLingo Variants:

MoLingo (VAE) and MoLingo (SAE) are two variants of our method. Both generate realistic human motions, but MoLingo (SAE) follows the text more faithfully, as shown below. All text prompts are from the HumanML3D test set.

The person puts the box down and runs.

MoLingo (VAE)

MoLingo (SAE)

Generates realistic running motion, but ignores the crucial "puts the box down" action.

Successfully executes the full two-stage instruction (put down the box while running).

A person does a push up and then uses his arms to balance himself back to his feet.

MoLingo (VAE)

MoLingo (SAE)

Generates realistic motion but fails to execute the push-up part of the sequence.

Successfully executes the full complex sequence including the push-up and balancing back to feet.

A person does a swimming motion while standing.

MoLingo (VAE)

MoLingo (SAE)

Fails to maintain a steady standing stance, constantly shifting weight between legs.

Maintains a steady standing stance while executing the swimming motion.

The person is walking while kicking out legs.

MoLingo (VAE)

MoLingo (SAE)

Kicks in place, failing to execute the crucial "walking" component of the instruction.

Successfully executes both actions simultaneously: walking and kicking out the legs.

A person runs to their right, then left, then right again, and finally walks back to their starting position.

MoLingo (VAE)

MoLingo (SAE)

Fails to follow the nuanced instructions: it starts by "walking" instead of "running" and ends by "running" instead of "walking."

Successfully follows all instructions and executes the complex, multi-stage directional and pace sequence accurately.

Motion Tracking with Unitree G1 Robot

We demonstrate the physical plausibility of the generated motions by deploying them on a Unitree G1 robot tracking task. We integrate MotionStreamer and MoLingo with a pre-trained RL tracking controller using the PHC strategy. We retarget the generated motions to the robot’s body shape and then directly deploy the tracking policy to assess performance. This comparison uses the 272D representation (where retargeting is simpler) to compare against MotionStreamer.

A person is swinging a tennis racket.

MotionStreamer

MoLingo (Ours)

The motion exhibits physical implausibility, forcing the controller to execute noticeable steps to maintain balance.

The tennis motion is stable and remains grounded, demonstrating high physical plausibility and minimal need for controller correction.

A figure dances ballet elegantly.

MotionStreamer

MoLingo (Ours)

MotionStreamer fails to follow the text instruction.

Successfully generates and tracks a complex, stable ballet motion.

The person is dribbling a basketball backwards.

MotionStreamer

MoLingo (Ours)

MotionStreamer fails to follow the text instruction.

Generates a dynamic and vivid dribbling motion that the robot successfully tracks.

Acknowledgments

Carl-Zeiss-Stiftung Tübingen AI Center University of Tübingen MPII Saarbrücken


We thank Chuqiao Li and István Sárándi for their helpful discussions and proofreading. We also thank Junyu Zhang for valuable discussions regarding the G1 policy training. The project was made possible by funding from the Carl Zeiss Foundation. This work is supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual Humans) and the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. JEL is supported by the German Research Foundation (DFG) - 556415750 (Emmy Noether Programme, project: Spatial Modeling and Reasoning). GPM is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 - Project number 390727645. This work was supported by the Engineering and Physical Sciences Research Council [grant number EP/X011364/1]. T. B. was supported by a UKRI Future Leaders Fellowship (MR/Y018818/1) as well as a Royal Society Research Grant (RG/R1/241402).

BibTeX

@misc{he25molingo,
    title = {MoLingo: Motion-Language Alignment for Text-to-Motion Generation},
    author = {He, Yannan and Tiwari, Garvita and Zhang, Xiaohan and Bora, Pankaj and Birdal, Tolga and Lenssen, Jan Eric and Pons-Moll, Gerard},
    year = {2025},
    archivePrefix = {arXiv},
}