Multimodal Analysis
updated
Analyzing The Language of Visual Tokens
Paper
• 2411.05001
• Published
• 24
Large Multi-modal Models Can Interpret Features in Large Multi-modal
Models
Paper
• 2411.14982
• Published
• 19
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
Training-Free Acceleration
Paper
• 2411.17686
• Published
• 19
On the Limitations of Vision-Language Models in Understanding Image
Transforms
Paper
• 2503.09837
• Published
• 10
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
• 2503.12605
• Published
• 35
When Less is Enough: Adaptive Token Reduction for Efficient Image
Representation
Paper
• 2503.16660
• Published
• 72
From Head to Tail: Towards Balanced Representation in Large
Vision-Language Models through Adaptive Data Calibration
Paper
• 2503.12821
• Published
• 10
Scaling Laws for Native Multimodal Models Scaling Laws for Native
Multimodal Models
Paper
• 2504.07951
• Published
• 30
Textual Steering Vectors Can Improve Visual Understanding in Multimodal
Large Language Models
Paper
• 2505.14071
• Published
• 1
MLLMs are Deeply Affected by Modality Bias
Paper
• 2505.18657
• Published
• 5
To Trust Or Not To Trust Your Vision-Language Model's Prediction
Paper
• 2505.23745
• Published
• 4
Vision Language Models are Biased
Paper
• 2505.23941
• Published
• 23
Truth in the Few: High-Value Data Selection for Efficient Multi-Modal
Reasoning
Paper
• 2506.04755
• Published
• 37
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand
Better
Paper
• 2506.09040
• Published
• 34
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation
Models on Standard Computer Vision Tasks
Paper
• 2507.01955
• Published
• 36
Robust Multimodal Large Language Models Against Modality Conflict
Paper
• 2507.07151
• Published
• 6
Automating Steering for Safe Multimodal Large Language Models
Paper
• 2507.13255
• Published
• 4
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token
Compression across Images, Videos, and Audios
Paper
• 2507.20198
• Published
• 28
Adapting Vision-Language Models Without Labels: A Comprehensive Survey
Paper
• 2508.05547
• Published
• 11
Enhancing Vision-Language Model Training with Reinforcement Learning in
Synthetic Worlds for Real-World Success
Paper
• 2508.04280
• Published
• 35
Controlling Multimodal LLMs via Reward-guided Decoding
Paper
• 2508.11616
• Published
• 7
IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding
Paper
• 2508.09456
• Published
• 8
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Paper
• 2508.18264
• Published
• 25
Visual Representation Alignment for Multimodal Large Language Models
Paper
• 2509.07979
• Published
• 84
Lost in Embeddings: Information Loss in Vision-Language Models
Paper
• 2509.11986
• Published
• 29
LLM-I: LLMs are Naturally Interleaved Multimodal Creators
Paper
• 2509.13642
• Published
• 9
When Big Models Train Small Ones: Label-Free Model Parity Alignment for
Efficient Visual Question Answering using Small VLMs
Paper
• 2509.16633
• Published
• 2
Where MLLMs Attend and What They Rely On: Explaining Autoregressive
Token Generation
Paper
• 2509.22496
• Published
• 4
Visual Jigsaw Post-Training Improves MLLMs
Paper
• 2509.25190
• Published
• 37
On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in
Large Vision-Language Models
Paper
• 2510.09008
• Published
• 16
RL makes MLLMs see better than SFT
Paper
• 2510.16333
• Published
• 49
Revisiting Multimodal Positional Encoding in Vision-Language Models
Paper
• 2510.23095
• Published
• 22
Don't Blind Your VLA: Aligning Visual Representations for OOD
Generalization
Paper
• 2510.25616
• Published
• 105
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation
Paper
• 2511.03774
• Published
• 13
Towards Mitigating Hallucinations in Large Vision-Language Models by
Refining Textual Embeddings
Paper
• 2511.05017
• Published
• 9
10 Open Challenges Steering the Future of Vision-Language-Action Models
Paper
• 2511.05936
• Published
• 6
Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models
Paper
• 2511.09809
• Published
• 5
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Paper
• 2511.19418
• Published
• 29
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Paper
• 2511.17487
• Published
• 12
Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Paper
• 2511.22663
• Published
• 29
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
Paper
• 2511.22826
• Published
• 8
Rethinking Chain-of-Thought Reasoning for Videos
Paper
• 2512.09616
• Published
• 19
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
Paper
• 2512.08923
• Published
• 1
An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges
Paper
• 2512.11362
• Published
• 22
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
Paper
• 2512.22238
• Published
• 29
Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models
Paper
• 2512.21815
• Published
• 22
MMFormalizer: Multimodal Autoformalization in the Wild
Paper
• 2601.03017
• Published
• 105
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
Paper
• 2601.06993
• Published
• 2
MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
Paper
• 2601.21181
• Published
• 8
The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning
Paper
• 2601.14127
• Published
• 5