Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought
Abstract
Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.
Main Results
We train on AVQA-PEM-14K and evaluate AVLMs from two angles: cross-modal hallucination and general audio-visual QA. Hallucination is measured on the QA subset of AVHBench (Audio-driven Video Hallucination, Video-driven Audio Hallucination, and Audio-visual Matching), while general QA is evaluated on AVQA, Valor32k-AVQA v2.0, and MUSIC-AVQA. Across two backbones under a unified protocol, our method consistently improves both hallucination robustness and overall QA accuracy, indicating reduced cross-modal interference and better generalization.
| Methods | AVHBench | General AVQA | ||||||
|---|---|---|---|---|---|---|---|---|
| VAH↑ | AVH↑ | MIS↑ | Avg.↑ | AVQA↑ | Valor2↑ | MUSIC↑ | Avg.↑ | |
| Qwen3-Omni-thinking | 77.12 | 82.31 | 70.68 | 75.95 | 90.79 | 75.59 | 63.64 | 76.49 |
| VideoLLaMA2.1-7B-AV | 73.93 | 61.71 | 51.76 | 63.47 | 85.69 | 60.57 | 79.32 | 68.65 |
| video-SALMONN-2+(7B) | 56.27 | 84.26 | 49.68 | 59.94 | 80.69 | 61.43 | 63.41 | 66.37 |
| gemini-3-flash | 72.14 | 71.65 | 72.81 | 72.27 | 89.34 | 72.67 | 58.81 | 73.27 |
| VideoLLaMA2-AVCD | - | - | - | 72.15† | - | - | 81.58† | - |
| Backbone: Qwen2.5-Omni-7B | ||||||||
| Zero-shot Inference | 61.41 | 70.02 | 61.51 | 63.29 | 88.07 | 66.36 | 58.82 | 69.14 |
| SFFL (our) | 62.27 | 78.61 | 59.49 | 64.79 | 88.67 | 70.59 | 62.71 | 71.69 |
| Backbone: Qwen3-Omni-30B-A3B-Instruct | ||||||||
| Zero-shot Inference | 74.28 | 81.95 | 66.36 | 73.12 | 89.62 | 76.56 | 66.00 | 76.33 |
| PEM-AVQA-14k (GRPO) | 75.20 | 81.69 | 73.08 | 75.84 | 91.31 | 76.35 | 66.61 | 77.53 |
| SFFL (our) | 80.79 | 85.12 | 79.58 | 81.29 | 92.31 | 77.43 | 69.93 | 80.24 |
↑ higher is better. † reported by the corresponding paper.
Video Demos
Three qualitative examples comparing Qwen3-Omni-thinking vs. our SFFL responses.