We compare visual motion generation performance across different methods given the same text prompt. All text prompts are sourced from the HumanML3D test set.
*We are aware that ACMDM can directly generate mesh-level sequences, but the mesh-generation checkpoint is not publicly released, so we implement the joint-level generation ourselves for a fair comparison.
MARDM: Rethinking Diffusion for Text-Driven Human Motion Generation. CVPR'25
ACMDM: Absolute Coordinates Make Motion Generation Easy. arXiv'25
MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space. ICCV'25
A person touches knees to opposing elbows and then does squats.
The elbow touching is not obvious.
Fails to perform the elbow touching motion.
Fails.
Successfully generates the instructed motion.
This person jumps up and down on the right leg.
Jumps with the wrong lead leg (left).
Jumps with the wrong lead leg (left).
Alternates legs and remains less faithful to the prompt.
Follows the text most accurately with realistic motion on the right leg.
A person runs forward then turns completely and does a cartwheel.
Does not turn around, and performs multiple cartwheels.
Does not run before turning around, and performs multiple cartwheels.
Fails.
Follows the text best (run, turn, and single cartwheel) while preserving realistic motion.
The person is in a fight stance turns around to the right.
The generated motion is realistic, but it abruptly freezes upon termination.
Fails to execute the turning action.
Fails to execute the turning action.
Follows the text best.
The person puts the box down and runs.
Executes the actions in the reverse order (runs first, then puts down the box).
Ignores the "putting down" action and only runs.
Fails.
Follows the text instruction best, executing the multi-stage sequence (put down, then run).
We evaluate the effect of our proposed Semantically Aligned Encoder (SAE) in this section. The comparison shows that a semantically aligned latent space enables better adherence to complex textual instructions.
MoLingo (VAE) and MoLingo (SAE) are two variants of our method. Both generate realistic human motions, but MoLingo (SAE) follows the text more faithfully, as shown below. All text prompts are from the HumanML3D test set.
The person puts the box down and runs.
Generates realistic running motion, but ignores the crucial "puts the box down" action.
Successfully executes the full two-stage instruction (put down the box while running).
A person does a push up and then uses his arms to balance himself back to his feet.
Generates realistic motion but fails to execute the push-up part of the sequence.
Successfully executes the full complex sequence including the push-up and balancing back to feet.
A person does a swimming motion while standing.
Fails to maintain a steady standing stance, constantly shifting weight between legs.
Maintains a steady standing stance while executing the swimming motion.
The person is walking while kicking out legs.
Kicks in place, failing to execute the crucial "walking" component of the instruction.
Successfully executes both actions simultaneously: walking and kicking out the legs.
A person runs to their right, then left, then right again, and finally walks back to their starting position.
Fails to follow the nuanced instructions: it starts by "walking" instead of "running" and ends by "running" instead of "walking."
Successfully follows all instructions and executes the complex, multi-stage directional and pace sequence accurately.
We demonstrate the physical plausibility of the generated motions by deploying them on a Unitree G1 robot tracking task. We integrate MotionStreamer and MoLingo with a pre-trained RL tracking controller using the PHC strategy. We retarget the generated motions to the robot’s body shape and then directly deploy the tracking policy to assess performance. This comparison uses the 272D representation (where retargeting is simpler) to compare against MotionStreamer.
A person is swinging a tennis racket.
The motion exhibits physical implausibility, forcing the controller to execute noticeable steps to maintain balance.
The tennis motion is stable and remains grounded, demonstrating high physical plausibility and minimal need for controller correction.
A figure dances ballet elegantly.
MotionStreamer fails to follow the text instruction.
Successfully generates and tracks a complex, stable ballet motion.
The person is dribbling a basketball backwards.
MotionStreamer fails to follow the text instruction.
Generates a dynamic and vivid dribbling motion that the robot successfully tracks.
@misc{he25molingo,
title = {MoLingo: Motion-Language Alignment for Text-to-Motion Generation},
author = {He, Yannan and Tiwari, Garvita and Zhang, Xiaohan and Bora, Pankaj and Birdal, Tolga and Lenssen, Jan Eric and Pons-Moll, Gerard},
year = {2025},
archivePrefix = {arXiv},
}