DUET-VLM (CVPR 2026)

> 99 %

accuracy retained

@ 67 % fewer visual tokens

> 97 %

accuracy retained

@ 89 % fewer visual tokens

− 31 %

training time

vs LLaVA-1.5-7B baseline

Abstract

Vision–language models (VLMs) waste compute on hundreds of redundant visual tokens that contribute little to the answer. DUET-VLM is a dual-stage token-reduction framework that jointly optimises the visual and language pathways: (i) a V2V stage merges semantically redundant patch tokens inside the vision encoder using a redundancy-aware clustering of self-attention, and (ii) a T2V stage prunes vision tokens inside the LLM layer-by-layer, guided by attention from salient text tokens.

Across image and video benchmarks, DUET-VLM matches or beats prior token-reduction methods (VisionZip, PyramidDrop, FastV, FitPrune, HiRED) at every budget — retaining > 99 % of full-token accuracy with a 67 % token cut and > 97 % at 89 %. Crucially, the same machinery also accelerates training, slashing wall-clock training time by ~31 % on LLaVA-1.5-7B with no measurable accuracy loss.

Method

Pipeline. (A) Redundancy-aware V2V merging inside the vision encoder; (B) text-guided T2V pruning across LLM layers.

Stage A · V2V — Vision-to-Vision merging

Rank tokens by V2V self-attention — keep the Top-k₁ dominant tokens.
Cluster the remaining residuals into k₂ groups (centroids = high-score peripheral tokens) and merge each via local neighbourhood averaging.
Applied pre-LLM, before any cross-modal fusion — preserves structural diversity without diluting fine cues.

Stage B · T2V — Text-guided pruning

Salient text tokens 𝒮 (including the last/sink token) attend to vision tokens with score A_t2v^(ℓ).
At every stage layer ℓ, retain ⌊(1−λ)·N_ℓ⌋ top-ranked visual tokens — drop the rest; λ ramps up as reasoning deepens.
Adaptive, content-aware — keeps query-relevant patches, sheds redundant background.

Results

Radar plot comparing DUET-VLM, VisionZip, PyramidDrop

Accuracy retained vs. full-token baseline on TextVQA, GQA, MME and SQA (LLaVA-1.5-7B, 192-token budget).

Training-time savings at 192 / 128 / 64-token budgets — accuracy almost untouched.

Average accuracy retained (Tab 1, condensed)

Method	192 tok	128 tok	64 tok
VisionZip	97.7 %	96.3 %	92.8 %
PyramidDrop	96.4 %	95.6 %	86.7 %
DUET-VLM (C)	99.0 %	98.1 %	95.4 %

Beyond images: video

100.8 %

average accuracy

@ 53.1 % tokens dropped

Video-LLaVA-7B · TGIF · MSVD · MSRVTT

97.6 %

average accuracy

@ 93.4 % tokens dropped

Video-LLaVA-7B (aggressive budget)

Qualitative

Text-guided attention isolates query-relevant patches across LLM layers — DUET-VLM preserves the regions the question actually depends on while shedding background.

BibTeX

@inproceedings{singh2026duetvlm,
  title     = {DUET-VLM: Dual-stage Unified Efficient Token reduction
               for VLM Training and Inference},
  author    = {Singh, Aditya Kumar and Kandala, Hitesh and
               Brahma, Pratik Prabhanjan and Liu, Zicheng and
               Barsoum, Emad},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer
               Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}

DUET-VLM: Dual-stage Unified Efficient Token reduction for VLM Training and Inference

DUET-VLM couples a redundancy-aware vision-to-vision merge with a text-guided text-to-vision prune, squeezing 576 visual tokens down to 64 while retaining > 99 % of baseline accuracy and cutting training time by ~31 %.