MutiModal_Paper - a L-Hongbin Collection

L-Hongbin 's Collections

MutiModal_Paper

MutiModal_Dataset

Optimizer_Papers

MutiModal_Paper

updated Oct 22, 2025

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 56
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Paper • 2411.07975 • Published Nov 12, 2024 • 31
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 87
Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21, 2024 • 47
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Paper • 2411.14347 • Published Nov 21, 2024 • 16
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Paper • 2411.14982 • Published Nov 22, 2024 • 19
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction

Paper • 2411.14762 • Published Nov 22, 2024 • 11
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Paper • 2411.02545 • Published Nov 4, 2024 • 1
Hymba: A Hybrid-head Architecture for Small Language Models

Paper • 2411.13676 • Published Nov 20, 2024 • 47
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Paper • 2411.11922 • Published Nov 18, 2024 • 19
ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 90
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

Paper • 2411.17686 • Published Nov 26, 2024 • 19
DreamMix: Decoupling Object Attributes for Enhanced Editability in Customized Image Inpainting

Paper • 2411.17223 • Published Nov 26, 2024 • 7
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Paper • 2411.15411 • Published Nov 23, 2024 • 8
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Paper • 2411.14522 • Published Nov 21, 2024 • 38
Knowledge Transfer Across Modalities with Natural Language Supervision

Paper • 2411.15611 • Published Nov 23, 2024 • 16
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Paper • 2411.18363 • Published Nov 27, 2024 • 10
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Paper • 2411.15241 • Published Nov 22, 2024 • 7
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

Paper • 2411.17787 • Published Nov 26, 2024 • 12
On Domain-Specific Post-Training for Multimodal Large Language Models

Paper • 2411.19930 • Published Nov 29, 2024 • 30
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Paper • 2409.19603 • Published Sep 29, 2024 • 19
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Paper • 2406.19389 • Published Jun 27, 2024 • 54
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Paper • 2412.03248 • Published Dec 4, 2024 • 26
CompCap: Improving Multimodal Large Language Models with Composite Captions

Paper • 2412.05243 • Published Dec 6, 2024 • 20
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 62
POINTS1.5: Building a Vision-Language Model towards Real World Applications

Paper • 2412.08443 • Published Dec 11, 2024 • 38
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published Dec 11, 2024 • 54
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Paper • 2412.09604 • Published Dec 12, 2024 • 38
Learned Compression for Compressed Learning

Paper • 2412.09405 • Published Dec 12, 2024 • 13
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

Paper • 2412.13871 • Published Dec 18, 2024 • 18
AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities

Paper • 2412.14123 • Published Dec 18, 2024 • 11
FastVLM: Efficient Vision Encoding for Vision Language Models

Paper • 2412.13303 • Published Dec 17, 2024 • 75
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Paper • 2412.05939 • Published Dec 8, 2024 • 15
Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Paper • 2412.04429 • Published Dec 5, 2024
syp115/EDC-1M

Viewer • Updated Mar 27, 2025 • 2.18M • 19 • 2
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published Jan 10, 2025 • 29
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Paper • 2502.05178 • Published Feb 7, 2025 • 10
VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Paper • 2502.05173 • Published Feb 7, 2025 • 64
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More

Paper • 2502.03738 • Published Feb 6, 2025 • 11
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Paper • 2501.12368 • Published Jan 21, 2025 • 45
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

Paper • 2503.18013 • Published Mar 23, 2025 • 20
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Paper • 2501.09781 • Published Jan 16, 2025 • 27
Where do Large Vision-Language Models Look at when Answering Questions?

Paper • 2503.13891 • Published Mar 18, 2025 • 8
Seedream 3.0 Technical Report

Paper • 2504.11346 • Published Apr 15, 2025 • 70
RL makes MLLMs see better than SFT

Paper • 2510.16333 • Published Oct 18, 2025 • 49