UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning

Institute of Artificial Intelligence, Beihang University
National University of Singapore
MMLab, CUHK
Hangzhou International Innovation Institute of Beihang University

*Equal Contribution
Corresponding authors
Header Image

UAV-Flow consists of a large-scale real-world dataset for language-conditioned UAV imitation learning, featuring multiple UAV platforms, diverse environments, and a wide range of fine-grained flight skill tasks. To enable systematic experimental analysis under the Flow task setting, we additionally provide a simulation-based evaluation protocol and deploy VLA models on real UAVs. To the best of our knowledge, this is the first real-world deployment of VLA models for language-guided UAV control in open environments.

Abstract

Unmanned Aerial Vehicles (UAVs) are evolving into language-interactive platforms, enabling more intuitive forms of human-drone interaction. While prior works have primarily focused on high-level planning and long-horizon navigation, we shift attention to language-guided fine-grained trajectory control, where UAVs execute short-range, reactive flight behaviors in response to language instructions. We formalize this problem as the Flying-on-a-Word (Flow) task and introduce UAV imitation learning as an effective approach. In this framework, UAVs learn fine-grained control policies by mimicking expert pilot trajectories paired with atomic language instructions. To support this paradigm, we present UAV-Flow, the first real-world benchmark for language-conditioned, fine-grained UAV control. It includes a task formulation, a large-scale dataset collected in diverse environments, a deployable control framework, and a simulation suite for systematic evaluation. Our design enables UAVs to closely imitate the precise, expert-level flight trajectories of human pilots and supports direct deployment without sim-to-real gap. We conduct extensive experiments on UAV-Flow, benchmarking VLN and VLA paradigms. Results show that VLA models are superior to VLN baselines and highlight the critical role of spatial grounding in the fine-grained Flow setting. As far as we are aware, we present the first real-world deployment of a VLA system for language-conditioned UAV control in open environments. Data, code, and real-world flight demos are available on https://prince687028.github.io/UAV-Flow.

Real-World Flight Demos

Paper

BibTeX

@misc{wang2025uavflowcolosseorealworldbenchmark,
      title={UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning}, 
      author={Xiangyu Wang and Donglin Yang and Yue Liao and Wenhao Zheng and wenjun wu and Bin Dai and Hongsheng Li and Si Liu},
      year={2025},
      eprint={2505.15725},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2505.15725}, 
}