SEATS

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

Zijie Xin1 Jie Yang2,* Ruixiang Zhao1 Tianyi Wang2 Fengyun Rao2 Jing LYU2 Xirong Li1,*
1Renmin University of China   2WeChat Vision, Tencent Inc.
*Corresponding author
Efficiency-performance trade-off

Efficiency-performance trade-off of training-free token selection methods for omni-modal LLMs. Our SEATS achieves higher performance with lower token selection and prefill latency.

TL;DR

We propose SEATS, a training-free, stage-adaptive token selection method for efficient omni-modal LLM inference. By analyzing layer-wise token dependency, we find visual and audio dependencies follow a block-wise pattern and weaken with depth. SEATS removes spatiotemporal redundancy before the LLM, progressively prunes tokens inside the LLM, and fully removes non-textual tokens in late layers. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.
9.3x FLOPs Reduction
4.8x Prefill Speedup
96.3% Performance Preserved
10% Token Retention Ratio

Abstract

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers.

To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete.

Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.

Observations: Layer-wise Token Dependency

We examine the effect of removing all visual and/or audio tokens at a specific LLM layer of an om-LLM. A consistent block-wise dependence pattern emerges: shallow layers critically depend on non-textual tokens (removal causes performance collapse), middle layers show moderate dependence (fusion underway), and late layers show no impact (fusion complete).

Layer-wise token dependency analysis
The impact of full visual / audio token removal on the performance of an om-LLM. Depending on the impact, we roughly divide the LLM layers into three blocks: shallow, middle, and late.

Method

SEATS is a three-stage method:

  1. Stage I: Pre-LLM Token Selection — Applies attention-weighted diversity selection (winDivPrune) within each temporal window to remove spatiotemporal redundancy and shorten the input sequence.
  2. Stage II: Inner-LLM Token Selection — Adopts a block-wise token-retention-ratio (TRR) decay schedule, progressively increasing pruning strength. It distributes the retention budget top-down: first across temporal windows, then across modalities, guided by query relevance scores.
  3. Stage III: Late Removal — Removes all remaining non-textual tokens in late LLM layers where cross-modal fusion is largely completed.
SEATS method overview
Proposed StagE-Adaptive Token Selection (SEATS) method for om-LLMs.

Experimental Results

Extensive experiments on five audio-visual benchmarks with Qwen2.5-Omni-7B and Qwen3-Omni-30B verify the effectiveness of our method.

Qwen2.5-Omni-7B
Qwen2.5-Omni-7B results
Qwen3-Omni-30B
Qwen3-Omni-30B results

Citation

@article{xin2025seats,
  title={Stage-adaptive Token Selection for Efficient Omni-modal LLMs},
  author={Xin, Zijie and Yang, Jie and Zhao, Ruixiang and Wang, Tianyi and Rao, Fengyun and LYU, Jing and Li, Xirong},
  journal={arXiv preprint arXiv:2605.20035},
  year={2025}
}

Acknowledgements

This research was supported by NSFC (No.62576348), BJNSF (No.L254039), Tencent WeChat Rhino-Bird Focused Research Program, and the Outstanding Innovative Talents Cultivation Funded Programs 2025 of Renmin University of China.