MutiModal_Paper
updated
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
• 2410.13861
• Published • 56
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation
Paper
• 2411.07975
• Published • 31
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
• 2411.10442
• Published • 87
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
• 2411.14402
• Published • 47
DINO-X: A Unified Vision Model for Open-World Object Detection and
Understanding
Paper
• 2411.14347
• Published • 16
Large Multi-modal Models Can Interpret Features in Large Multi-modal
Models
Paper
• 2411.14982
• Published • 19
Efficient Long Video Tokenization via Coordinated-based Patch
Reconstruction
Paper
• 2411.14762
• Published • 11
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic
Vision-Language Negatives
Paper
• 2411.02545
• Published • 1
Hymba: A Hybrid-head Architecture for Small Language Models
Paper
• 2411.13676
• Published • 47
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking
with Motion-Aware Memory
Paper
• 2411.11922
• Published • 19
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
• 2411.17465
• Published • 90
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
Training-Free Acceleration
Paper
• 2411.17686
• Published • 19
DreamMix: Decoupling Object Attributes for Enhanced Editability in
Customized Image Inpainting
Paper
• 2411.17223
• Published • 7
FINECAPTION: Compositional Image Captioning Focusing on Wherever You
Want at Any Granularity
Paper
• 2411.15411
• Published • 8
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A
Comprehensive Multimodal Dataset Towards General Medical AI
Paper
• 2411.14522
• Published • 38
Knowledge Transfer Across Modalities with Natural Language Supervision
Paper
• 2411.15611
• Published • 16
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Paper
• 2411.18363
• Published • 10
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State
Space Duality
Paper
• 2411.15241
• Published • 7
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
Paper
• 2411.17787
• Published • 12
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper
• 2411.19930
• Published • 30
One Token to Seg Them All: Language Instructed Reasoning Segmentation in
Videos
Paper
• 2409.19603
• Published • 19
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
• 2406.19389
• Published • 54
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and
Pruning
Paper
• 2412.03248
• Published • 26
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
• 2412.05243
• Published • 20
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
• 2412.04424
• Published • 62
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
• 2412.08443
• Published • 38
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published • 54
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
• 2412.09604
• Published • 38
Learned Compression for Compressed Learning
Paper
• 2412.09405
• Published • 13
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via
Hierarchical Window Transformer
Paper
• 2412.13871
• Published • 18
AnySat: An Earth Observation Model for Any Resolutions, Scales, and
Modalities
Paper
• 2412.14123
• Published • 11
FastVLM: Efficient Vision Encoding for Vision Language Models
Paper
• 2412.13303
• Published • 75
Exploring Multi-Grained Concept Annotations for Multimodal Large
Language Models
Paper
• 2412.05939
• Published • 15
Grounding Descriptions in Images informs Zero-Shot Visual Recognition
Paper
• 2412.04429
• Published
Viewer
• Updated • 2.18M • 19
• 2
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in
Multimodal Large Language Models
Paper
• 2501.05767
• Published • 29
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive
Multimodal Understanding and Generation
Paper
• 2502.05178
• Published • 10
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Paper
• 2502.05173
• Published • 64
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Paper
• 2502.03738
• Published • 11
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward
Model
Paper
• 2501.12368
• Published • 45
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
via Vision-Guided Reinforcement Learning
Paper
• 2503.18013
• Published • 20
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper
• 2501.09781
• Published • 27
Where do Large Vision-Language Models Look at when Answering Questions?
Paper
• 2503.13891
• Published • 8
Seedream 3.0 Technical Report
Paper
• 2504.11346
• Published • 70
RL makes MLLMs see better than SFT
Paper
• 2510.16333
• Published • 49