kaizuberbuehler 's Collections Benchmarks
updated
GAIA: a benchmark for General AI Assistants
Paper
• 2311.12983
• Published
• 246
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
Benchmark for Expert AGI
Paper
• 2311.16502
• Published
• 38
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
• 2404.12390
• Published
• 26
RULER: What's the Real Context Size of Your Long-Context Language
Models?
Paper
• 2404.06654
• Published
• 39
CantTalkAboutThis: Aligning Language Models to Stay on Topic in
Dialogues
Paper
• 2404.03820
• Published
• 25
CodeEditorBench: Evaluating Code Editing Capability of Large Language
Models
Paper
• 2404.03543
• Published
• 18
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and
Human Ratings
Paper
• 2404.16820
• Published
• 17
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
• 2404.16790
• Published
• 10
On the Planning Abilities of Large Language Models -- A Critical
Investigation
Paper
• 2305.15771
• Published
• 1
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
• 2405.21075
• Published
• 26
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Paper
• 2406.09170
• Published
• 27
MuirBench: A Comprehensive Benchmark for Robust Multi-image
Understanding
Paper
• 2406.09411
• Published
• 19
CS-Bench: A Comprehensive Benchmark for Large Language Models towards
Computer Science Mastery
Paper
• 2406.08587
• Published
• 16
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
• 2406.11833
• Published
• 62
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
Chart-to-Code Generation
Paper
• 2406.09961
• Published
• 55
Needle In A Multimodal Haystack
Paper
• 2406.07230
• Published
• 54
BABILong: Testing the Limits of LLMs with Long Context
Reasoning-in-a-Haystack
Paper
• 2406.10149
• Published
• 52
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Paper
• 2407.18961
• Published
• 40
AppWorld: A Controllable World of Apps and People for Benchmarking
Interactive Coding Agents
Paper
• 2407.18901
• Published
• 35
WebArena: A Realistic Web Environment for Building Autonomous Agents
Paper
• 2307.13854
• Published
• 27
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards
General Medical AI
Paper
• 2408.03361
• Published
• 85
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java
Paper
• 2408.14354
• Published
• 41
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated
clinical environments
Paper
• 2405.07960
• Published
• 1
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft
Reasoning
Paper
• 2310.16049
• Published
• 5
MMSearch: Benchmarking the Potential of Large Models as Multi-modal
Search Engines
Paper
• 2409.12959
• Published
• 38
DSBench: How Far Are Data Science Agents to Becoming Data Science
Experts?
Paper
• 2409.07703
• Published
• 66
HelloBench: Evaluating Long Text Generation Capabilities of Large
Language Models
Paper
• 2409.16191
• Published
• 41
OmniBench: Towards The Future of Universal Omni-Language Models
Paper
• 2409.15272
• Published
• 30
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large
Language Models
Paper
• 2410.07985
• Published
• 32
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
• 2404.07972
• Published
• 51
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic
Long-context Multitasks
Paper
• 2412.15204
• Published
• 38
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World
Tasks
Paper
• 2412.14161
• Published
• 51
Are Your LLMs Capable of Stable Reasoning?
Paper
• 2412.13147
• Published
• 93
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published
• 54
CodeElo: Benchmarking Competition-level Code Generation of LLMs with
Human-comparable Elo Ratings
Paper
• 2501.01257
• Published
• 51
A3: Android Agent Arena for Mobile GUI Agents
Paper
• 2501.01149
• Published
• 22
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on
Self-invoking Code Generation
Paper
• 2412.21199
• Published
• 13
ResearchTown: Simulator of Human Research Community
Paper
• 2412.17767
• Published
• 14
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Paper
• 2412.14470
• Published
• 13
The BrowserGym Ecosystem for Web Agent Research
Paper
• 2412.05467
• Published
• 24
Evaluating Language Models as Synthetic Data Generators
Paper
• 2412.03679
• Published
• 47
Paper
• 2412.04315
• Published
• 19
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills
in LLMs
Paper
• 2412.03205
• Published
• 19
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
And A Retrieval-Aware Tuning Framework
Paper
• 2411.06176
• Published
• 45
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for
Evaluating Foundation Models
Paper
• 2411.04075
• Published
• 16
From Medprompt to o1: Exploration of Run-Time Strategies for Medical
Challenge Problems and Beyond
Paper
• 2411.03590
• Published
• 10
URSA: Understanding and Verifying Chain-of-thought Reasoning in
Multimodal Mathematics
Paper
• 2501.04686
• Published
• 53
SOTOPIA: Interactive Evaluation for Social Intelligence in Language
Agents
Paper
• 2310.11667
• Published
• 4
PokerBench: Training Large Language Models to become Professional Poker
Players
Paper
• 2501.08328
• Published
• 19
WebWalker: Benchmarking LLMs in Web Traversal
Paper
• 2501.07572
• Published
• 23
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
• 2501.05510
• Published
• 44
Are VLMs Ready for Autonomous Driving? An Empirical Study from the
Reliability, Data, and Metric Perspectives
Paper
• 2501.04003
• Published
• 27
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Paper
• 2501.08828
• Published
• 30
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Paper
• 2501.09012
• Published
• 10
Do generative video models learn physical principles from watching
videos?
Paper
• 2501.09038
• Published
• 34
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper
• 2501.12380
• Published
• 84
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline
Professional Videos
Paper
• 2501.13826
• Published
• 23
Paper
• 2501.14249
• Published
• 77
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and
Understanding
Paper
• 2501.18362
• Published
• 23
PhysBench: Benchmarking and Enhancing Vision-Language Models for
Physical World Understanding
Paper
• 2501.16411
• Published
• 19
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of
Physical Concept Understanding
Paper
• 2502.08946
• Published
• 191
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of
Large Language Model
Paper
• 2501.18636
• Published
• 31
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal
Models
Paper
• 2502.00698
• Published
• 24
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Paper
• 2502.01100
• Published
• 21
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning
Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
Paper
• 2502.01081
• Published
• 13
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance
Software Engineering?
Paper
• 2502.12115
• Published
• 46
Expect the Unexpected: FailSafe Long Context QA for Finance
Paper
• 2502.06329
• Published
• 133
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Paper
• 2502.07445
• Published
• 11
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large
Language Models
Paper
• 2502.07346
• Published
• 53
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance
Paper
• 2502.08127
• Published
• 59
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
Paper
• 2502.08047
• Published
• 28
NoLiMa: Long-Context Evaluation Beyond Literal Matching
Paper
• 2502.05167
• Published
• 16
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents
Paper
• 2502.09560
• Published
• 35
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for
Reasoning Quality, Robustness, and Efficiency
Paper
• 2502.09621
• Published
• 28
Logical Reasoning in Large Language Models: A Survey
Paper
• 2502.09100
• Published
• 24
Mathematical Reasoning in Large Language Models: Assessing Logical and
Arithmetic Errors across Wide Numerical Ranges
Paper
• 2502.08680
• Published
• 11
ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models
Paper
• 2502.09696
• Published
• 43
MMTEB: Massive Multilingual Text Embedding Benchmark
Paper
• 2502.13595
• Published
• 45
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Paper
• 2502.14499
• Published
• 194
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex
Task Automation on PC
Paper
• 2502.14282
• Published
• 29
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Paper
• 2502.14739
• Published
• 108
Text2World: Benchmarking Large Language Models for Symbolic World Model
Generation
Paper
• 2502.13092
• Published
• 13
IHEval: Evaluating Language Models on Following the Instruction
Hierarchy
Paper
• 2502.08745
• Published
• 20
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem
Understanding
Paper
• 2502.19400
• Published
• 47
Paper
• 2502.19187
• Published
• 10
WebGames: Challenging General-Purpose Web-Browsing AI Agents
Paper
• 2502.18356
• Published
• 14
StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction
Following
Paper
• 2502.14494
• Published
• 15
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
• 2502.16033
• Published
• 18
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language
Models
Paper
• 2502.16614
• Published
• 27
Can Language Models Falsify? Evaluating Algorithmic Reasoning with
Counterexample Creation
Paper
• 2502.19414
• Published
• 20
CODESYNC: Synchronizing Large Language Models with Dynamic Code
Evolution at Scale
Paper
• 2502.16645
• Published
• 21
DeepSolution: Boosting Complex Engineering Solution Design via
Tree-based Exploration and Bi-point Thinking
Paper
• 2502.20730
• Published
• 38
LINGOLY-TOO: Disentangling Memorisation from Reasoning with Linguistic
Templatisation and Orthographic Obfuscation
Paper
• 2503.02972
• Published
• 25
MultiAgentBench: Evaluating the Collaboration and Competition of LLM
agents
Paper
• 2503.01935
• Published
• 30
SafeArena: Evaluating the Safety of Autonomous Web Agents
Paper
• 2503.04957
• Published
• 21
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation
for Feature Implementation
Paper
• 2503.06680
• Published
• 20
WritingBench: A Comprehensive Benchmark for Generative Writing
Paper
• 2503.05244
• Published
• 22
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for
Complex Medical Reasoning
Paper
• 2503.07459
• Published
• 16
Benchmarking AI Models in Software Engineering: A Review, Search Tool,
and Enhancement Protocol
Paper
• 2503.05860
• Published
• 11
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large
Vision-Language Models in Fact-Seeking Question Answering
Paper
• 2503.06492
• Published
• 11
WildIFEval: Instruction Following in the Wild
Paper
• 2503.06573
• Published
• 14
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial
Samples
Paper
• 2410.14669
• Published
• 39
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain
Knowledge
Paper
• 2504.10342
• Published
• 11
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper
• 2310.06770
• Published
• 9
Survey on Evaluation of LLM-based Agents
Paper
• 2503.16416
• Published
• 96
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Paper
• 2503.14478
• Published
• 48
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?
Paper
• 2503.12349
• Published
• 44
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
• 2503.12797
• Published
• 32
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the
LLM Era
Paper
• 2503.12329
• Published
• 27
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based
Scientific Research
Paper
• 2503.13399
• Published
• 22
VERIFY: A Benchmark of Visual Explanation and Reasoning for
Investigating Multimodal Reasoning Fidelity
Paper
• 2503.11557
• Published
• 22
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
Paper
• 2503.11495
• Published
• 14
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning
Tasks
Paper
• 2503.15478
• Published
• 13
Measuring AI Ability to Complete Long Tasks
Paper
• 2503.14499
• Published
• 16
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process
Errors Identification
Paper
• 2503.12505
• Published
• 11
BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space
Complexity?
Paper
• 2503.15242
• Published
• 10
Challenging the Boundaries of Reasoning: An Olympiad-Level Math
Benchmark for Large Language Models
Paper
• 2503.21380
• Published
• 38
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Paper
• 2503.19990
• Published
• 35
Exploring Hallucination of Large Multimodal Models in Video
Understanding: Benchmark, Analysis and Mitigation
Paper
• 2503.19622
• Published
• 31
ResearchBench: Benchmarking LLMs in Scientific Discovery via
Inspiration-Based Task Decomposition
Paper
• 2503.21248
• Published
• 21
Video SimpleQA: Towards Factuality Evaluation in Large Video Language
Models
Paper
• 2503.18923
• Published
• 14
Can Large Vision Language Models Read Maps Like a Human?
Paper
• 2503.14607
• Published
• 10
Exploring the Effect of Reinforcement Learning on Video Understanding:
Insights from SEED-Bench-R1
Paper
• 2503.24376
• Published
• 38
PaperBench: Evaluating AI's Ability to Replicate AI Research
Paper
• 2504.01848
• Published
• 37
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive
Program Synthesis
Paper
• 2503.23145
• Published
• 35
How Many Instructions Can LLMs Follow at Once?
Paper
• 2507.11538
• Published
• 2
YourBench: Easy Custom Evaluation Sets for Everyone
Paper
• 2504.01833
• Published
• 22
PHYSICS: Benchmarking Foundation Models on University-Level Physics
Problem Solving
Paper
• 2503.21821
• Published
• 21
VCR-Bench: A Comprehensive Evaluation Framework for Video
Chain-of-Thought Reasoning
Paper
• 2504.07956
• Published
• 46
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Paper
• 2504.02605
• Published
• 48
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning
Paper
• 2503.22738
• Published
• 17
Towards Visual Text Grounding of Multimodal Large Language Model
Paper
• 2504.04974
• Published
• 17
MME-Unify: A Comprehensive Benchmark for Unified Multimodal
Understanding and Generation Models
Paper
• 2504.03641
• Published
• 14
Generative Evaluation of Complex Reasoning in Large Language Models
Paper
• 2504.02810
• Published
• 14
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric
Capabilities in Multimodal Large Language Models
Paper
• 2504.06148
• Published
• 13
ColorBench: Can VLMs See and Understand the Colorful World? A
Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Paper
• 2504.10514
• Published
• 48
Paper
• 2504.11442
• Published
• 30
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent
Trajectories
Paper
• 2504.08942
• Published
• 28
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question
Answering
Paper
• 2504.05506
• Published
• 25
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability
of Large Reasoning Models
Paper
• 2504.10368
• Published
• 22
MIEB: Massive Image Embedding Benchmark
Paper
• 2504.10471
• Published
• 21
MLRC-Bench: Can Language Agents Solve Machine Learning Research
Challenges?
Paper
• 2504.09702
• Published
• 18
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models
Paper
• 2504.15279
• Published
• 78
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
Paper
• 2504.15521
• Published
• 64
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in
Large Language Models
Paper
• 2504.16074
• Published
• 36
Seeing from Another Perspective: Evaluating Multi-View Understanding in
MLLMs
Paper
• 2504.15280
• Published
• 25
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning
in Multimodal LLMs
Paper
• 2504.15415
• Published
• 23