Adapting Large Language Models via Reading Comprehension
Paper
• 2309.09530
• Published
• 82
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
Paper
• 2309.09958
• Published
• 20
Noise-Aware Training of Layout-Aware Language Models
Paper
• 2404.00488
• Published
• 10
Streaming Dense Video Captioning
Paper
• 2404.01297
• Published
• 13
Aurora-M: The First Open Source Multilingual Language Model Red-teamed
according to the U.S. Executive Order
Paper
• 2404.00399
• Published
• 42
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
• 2404.03413
• Published
• 27
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
• 2404.03118
• Published
• 25
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak
Attacks?
Paper
• 2404.03411
• Published
• 10
Mixture-of-Depths: Dynamically allocating compute in transformer-based
language models
Paper
• 2404.02258
• Published
• 107
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale
Prediction
Paper
• 2404.02905
• Published
• 74
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image
Generation
Paper
• 2404.02733
• Published
• 22
FlowMind: Automatic Workflow Generation with LLMs
Paper
• 2404.13050
• Published
• 34
Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image
Synthesis
Paper
• 2404.13686
• Published
• 29
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension
and Generation
Paper
• 2404.14396
• Published
• 19
LAMBDA: A Large Model Based Data Agent
Paper
• 2407.17535
• Published
• 37
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Paper
• 2407.17490
• Published
• 31
Very Large-Scale Multi-Agent Simulation in AgentScope
Paper
• 2407.17789
• Published
• 35
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
• 2407.18121
• Published
• 17
VILA^2: VILA Augmented VILA
Paper
• 2407.17453
• Published
• 41
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person
Paper
• 2407.16224
• Published
• 29
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
• 2407.15841
• Published
• 40
An Object is Worth 64x64 Pixels: Generating 3D Object via Image
Diffusion
Paper
• 2408.03178
• Published
• 40
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation
Paper
• 2408.02629
• Published
• 15
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
• 2408.01800
• Published
• 92
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable
Transcripts
Paper
• 2409.00447
• Published
• 3
QuickVideo: Real-Time Long Video Understanding with System Algorithm
Co-Design
Paper
• 2505.16175
• Published
• 42