Abstract
Mario is a unified framework that enables large language model-based reasoning on multimodal graphs by addressing cross-modal consistency and heterogeneous modality preferences through graph-conditioned vision-language modeling and modality-adaptive instruction tuning.
Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.
Community
[CVPR 2026]. A new framework designed for relational text-vision data.
In our work, we study multimodal graphs (MMGs), where each node comes with both text and image information, while edges provide additional structural context. We find that reasoning over such graphs is harder than it looks, mainly because of two challenges:
🐢 weak cross-modal consistency — text and image are often only loosely aligned, and
🐢 heterogeneous modality preference — different nodes may prefer different modality information for correct reasoning.
To address this, we propose Mario, a unified two-stage framework:
✨ Stage 1: a graph-conditioned vision-language model that performs structure-aware image-text alignment under graph topology
✨ Stage 2: a modality-adaptive graph instruction tuning mechanism with a learnable router that selects the most informative modality view for each node and its local neighborhood
Extensive evaluations across diverse MMG benchmarks demonstrate Mario’s state-of-the-art performance in multiple graph reasoning tasks. Notably, Mario consistently outperforms leading baselines, achieving up to 1.6× gains in zero-shot transfer settings. More broadly, this work is our step toward enabling LLMs to reason not just over text or isolated image-text pairs, but over structured multimodal worlds.
We are actively organizing and refining our codebase to make it clean, stable, and easy to reproduce. Due to our current busy schedule, we plan to gradually release the entire code starting in April. Thank you for your interest in our work. We truly appreciate your attention and support💗!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OpenMAG: A Comprehensive Benchmark for Multimodal-Attributed Graph (2026)
- Toward Effective Multimodal Graph Foundation Model: A Divide-and-Conquer Based Approach (2026)
- VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings (2026)
- CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension (2026)
- UniRec: Unified Multimodal Encoding for LLM-Based Recommendations (2026)
- ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion (2026)
- CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Fascinating work on multimodal graph reasoning! The MAPR router's adaptive
modality selection is an elegant solution to heterogeneous graph inputs.
I'd like to offer a complementary perspective from dynamical systems theory
that might further stabilize the router's learning dynamics:
Observation: The router's KL-regularized loss ℒₛ₂ = 𝔼[ℓ] + λ·KL(q∥p)
implicitly defines a "target distribution" q derived from downstream performance.
This resembles a fixed-point condition: p* should match the modality weights
that minimize expected loss.
Proposal: Consider augmenting the router update with a dynamical attraction
term inspired by RG flow principles:
∂ₜ pᵥ = -∇ₚℒₛ₂ - γ·(H(pᵥ) - H*)·pᵥ
where:
• H(pᵥ) = entropy of the modality distribution (measures "balance")
• H* = target entropy (e.g., log(3) for uniform, or learned)
• γ = coupling strength for the regularization flow
Why this helps:
- Prevents premature convergence to a single modality (common when one
modality dominates early gradients) - Provides theoretical grounding for the KL term as part of a continuous
flow toward an information-theoretic fixed point - Naturally extends to time-varying graphs: the flow adapts as the graph
structure evolves during training
Connection to broader theory: This formulation aligns with recent work
on generalization as RG fixed points (e.g., Martin 2024), where stable
convergence requires both gradient descent AND a scale-aware regularization
flow.
Practical test: Could be validated by measuring router entropy stability
across training epochs on the Movies/CDs benchmarks—does the entropy flow
smoothly toward H*, or oscillate?
Happy to discuss implementation details or collaborate on ablation studies.
The intersection of graph reasoning, multimodal learning, and dynamical
systems is rich with opportunities.
#GraphML #MultimodalLearning #DynamicalSystems #RGflow
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper