Skip to content
Mobrief AI intelligence
Mobrief
Back to archive

Research · arXiv cs.AI

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Multimodal Mixture-of-Experts (Mo E) models have achieved remarkable performance on vision-language tasks.

Apr 09, 2026 17:59 UTC · Paper: ~15 min · Research Source
  • AI identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as…
  • Through systematic analysis, arXiv cs.
  • AI first verify that cross-modal semantic sharing exists in Mo E architectures, ruling out semantic alignment failure as the sole explanation. arXiv cs. AI then reveal that visual experts and…

Context

However, arXiv cs. AI identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, arXiv cs. AI first verify that cross-modal semantic sharing exists in Mo E architectures, ruling out semantic alignment failure as the sole explanation. AI then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, arXiv cs. AI proposes the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, arXiv cs. AI designs a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal Mo E models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. AI's analysis further reveals that domain expert identification locates cognitive…

AI identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as…

Paper PDF