DUET-VLM: Dual-stage Unified Efficient Token reduction
for VLM Training and Inference

Advanced Micro Devices, Inc. (AMD-AGI) *Equal contribution
DUET-VLM teaser: training-time accuracy on LLaVA

DUET-VLM couples a redundancy-aware vision-to-vision merge with a text-guided text-to-vision prune, squeezing 576 visual tokens down to 64 while retaining > 99 % of baseline accuracy and cutting training time by ~31 %.

> 99 %
accuracy retained
@ 67 % fewer visual tokens
> 97 %
accuracy retained
@ 89 % fewer visual tokens
− 31 %
training time
vs LLaVA-1.5-7B baseline

Abstract

Vision–language models (VLMs) waste compute on hundreds of redundant visual tokens that contribute little to the answer. DUET-VLM is a dual-stage token-reduction framework that jointly optimises the visual and language pathways: (i) a V2V stage merges semantically redundant patch tokens inside the vision encoder using a redundancy-aware clustering of self-attention, and (ii) a T2V stage prunes vision tokens inside the LLM layer-by-layer, guided by attention from salient text tokens.

Across image and video benchmarks, DUET-VLM matches or beats prior token-reduction methods (VisionZip, PyramidDrop, FastV, FitPrune, HiRED) at every budget — retaining > 99 % of full-token accuracy with a 67 % token cut and > 97 % at 89 %. Crucially, the same machinery also accelerates training, slashing wall-clock training time by ~31 % on LLaVA-1.5-7B with no measurable accuracy loss.

Method

DUET-VLM pipeline

Pipeline. (A) Redundancy-aware V2V merging inside the vision encoder; (B) text-guided T2V pruning across LLM layers.

Stage A · V2V — Vision-to-Vision merging

  • Rank tokens by V2V self-attention — keep the Top-k1 dominant tokens.
  • Cluster the remaining residuals into k2 groups (centroids = high-score peripheral tokens) and merge each via local neighbourhood averaging.
  • Applied pre-LLM, before any cross-modal fusion — preserves structural diversity without diluting fine cues.

Stage B · T2V — Text-guided pruning

  • Salient text tokens 𝒮 (including the last/sink token) attend to vision tokens with score At2v(ℓ).
  • At every stage layer ℓ, retain ⌊(1−λ)·N⌋ top-ranked visual tokens — drop the rest; λ ramps up as reasoning deepens.
  • Adaptive, content-aware — keeps query-relevant patches, sheds redundant background.

Results

Radar plot comparing DUET-VLM, VisionZip, PyramidDrop

Accuracy retained vs. full-token baseline on TextVQA, GQA, MME and SQA (LLaVA-1.5-7B, 192-token budget).

Training-time savings bars

Training-time savings at 192 / 128 / 64-token budgets — accuracy almost untouched.

Average accuracy retained (Tab 1, condensed)

Method 192 tok 128 tok 64 tok
VisionZip 97.7 %96.3 %92.8 %
PyramidDrop 96.4 %95.6 %86.7 %
DUET-VLM (C) 99.0 % 98.1 % 95.4 %

Beyond images: video

100.8 %
average accuracy
@ 53.1 % tokens dropped
Video-LLaVA-7B · TGIF · MSVD · MSRVTT
97.6 %
average accuracy
@ 93.4 % tokens dropped
Video-LLaVA-7B (aggressive budget)

Qualitative

Qualitative example 1
Qualitative example 2

Text-guided attention isolates query-relevant patches across LLM layers — DUET-VLM preserves the regions the question actually depends on while shedding background.

BibTeX

@inproceedings{singh2026duetvlm,
  title     = {DUET-VLM: Dual-stage Unified Efficient Token reduction
               for VLM Training and Inference},
  author    = {Singh, Aditya Kumar and Kandala, Hitesh and
               Brahma, Pratik Prabhanjan and Liu, Zicheng and
               Barsoum, Emad},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer
               Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}