Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

Teaser image for Separate First, Fuse Later

Overview of Separate First, Fuse Later (SFFL) reasoning framework.

Abstract

Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.

Main Results

We train on AVQA-PEM-14K and evaluate AVLMs from two angles: cross-modal hallucination and general audio-visual QA. Hallucination is measured on the QA subset of AVHBench (Audio-driven Video Hallucination, Video-driven Audio Hallucination, and Audio-visual Matching), while general QA is evaluated on AVQA, Valor32k-AVQA v2.0, and MUSIC-AVQA. Across two backbones under a unified protocol, our method consistently improves both hallucination robustness and overall QA accuracy, indicating reduced cross-modal interference and better generalization.

Methods AVHBench General AVQA
VAH↑ AVH↑ MIS↑ Avg.↑ AVQA↑ Valor2↑ MUSIC↑ Avg.↑
Qwen3-Omni-thinking 77.1282.3170.6875.95 90.7975.5963.6476.49
VideoLLaMA2.1-7B-AV 73.9361.7151.7663.47 85.6960.5779.3268.65
video-SALMONN-2+(7B) 56.2784.2649.6859.94 80.6961.4363.4166.37
gemini-3-flash 72.1471.6572.8172.27 89.3472.6758.8173.27
VideoLLaMA2-AVCD ---72.15† --81.58-
Backbone: Qwen2.5-Omni-7B
Zero-shot Inference 61.4170.0261.5163.29 88.0766.3658.8269.14
SFFL (our) 62.2778.6159.4964.79 88.6770.5962.7171.69
Backbone: Qwen3-Omni-30B-A3B-Instruct
Zero-shot Inference 74.2881.9566.3673.12 89.6276.5666.0076.33
PEM-AVQA-14k (GRPO) 75.2081.6973.0875.84 91.3176.3566.6177.53
SFFL (our) 80.7985.1279.5881.29 92.3177.4369.9380.24

↑ higher is better. † reported by the corresponding paper.

Video Demos

Three qualitative examples comparing Qwen3-Omni-thinking vs. our SFFL responses.

Demo 1

Question Are the contexts of audio and visual content matching?

Qwen3-Omni-thinking

<think>\nGot it, let's check. The video shows a dark, industrial setting with a countdown. The audio has a voice counting from 0 to 9. The visual part includes text saying \"The factory isn't safe...\", which matches the context of the factory being unsafe. The countdown audio and the visual of the factory environment—so the audio (counting) and visual (factory, warning text) are related. Wait, the question is if the contexts match. Let's see: the audio is a countdown, the visual is a factory with a warning. The countdown might be part of the factory scenario, so they do match. So the answer is 'yes'.\n</think> <answer>yes</answer>\"

Ours (SFFL) <mod>Audio-Visual</mod><v>The video shows a dark, grainy, and distorted view of what appears to be an industrial setting, possibly a factory. The scene is mostly static with some slow movement, and the overall atmosphere is eerie and unsettling. Towards the end of the video, text appears on the screen that reads, \"The factory isn't safe...\"</v> <a>The audio contains a person counting from zero to nine, followed by a series of beeps.</a> <sum>The visual content depicts a dark, industrial setting with a warning message about the factory not being safe. The audio consists of a person counting and beeps, which does not directly relate to the visual content. The contexts of the audio and visual content do not match.</sum> <answer>no</answer>

Demo 2

Question Is the wind chime making sound in the audio?

Qwen3-Omni-thinking

<think>\nGot it, let's check the video. The video shows a wind chime in a bamboo forest. The audio—do we hear the chime? Let's think. The question is if the wind chime is making sound. In the video, the chime is visible, and typically wind chimes make sound when the wind moves them. But wait, the audio might have the chime's sound. Let's recall: the video's audio probably includes the chime's sound. So the answer should be 'yes'.\n</think> <answer>yes</answer>

Ours (SFFL)

<mod>Audio</mod> <v>The video shows a bamboo forest with a wind chime hanging in the foreground. The wind chime is visible and appears to be made of metal tubes. The camera focuses on the wind chime as it moves slightly, suggesting the presence of wind.</v> <a>The audio contains the sound of flowing water, possibly a stream or river. There is also some indistinct human speech in the background. However, there is no audible sound of a wind chime.</a> <sum>The required modality is Audio. The audio does not contain the sound of a wind chime, only the sound of flowing water and indistinct human speech. Therefore, the answer is based on the audio information alone.</sum> <answer>no</answer>

Demo 3

Question Is the talking man visible in the video?

Qwen3-Omni-thinking

<think> Got it, let's check the video. In the first part, there's a man speaking: \"There they go.\" So the talking man is visible. Let's confirm. Yes, the man is there, so the answer is 'yes'.\n</think> <answer> yes</answer>

Ours (SFFL)

<mod>Visual</mod> <v>The video shows a street scene with multiple fire trucks moving through traffic. There are pedestrians visible on the sidewalks, and one man in a dark jacket is seen walking on the sidewalk. However, there is no indication that this man is talking, and no close-up or clear view of him speaking is provided.</v> <a>The audio contains the sound of a man speaking.</a> <sum>The visual modality is necessary to answer the question. The audio confirms a man is talking, but the visual information is required to determine if he is visible in the video. The man in the dark jacket is visible, but there is no evidence he is the one talking. </sum> <answer>no</answer>