Papers
arxiv:2603.09652

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

Published on Mar 10
Β· Submitted by
taesiri
on Mar 11
Authors:
,
,
,

Abstract

MiniAppBench introduces the first comprehensive benchmark for evaluating principle-driven, interactive application generation, addressing the gap in existing benchmarks that focus on static correctness rather than dynamic, real-world interactions.

AI-generated summary

With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construct customized interaction logic that adheres to real-world principles. However, existing benchmarks primarily focus on algorithmic correctness or static layout reconstruction, failing to capture the capabilities required for this new paradigm. To address this gap, we introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. Sourced from a real-world application with 10M+ generations, MiniAppBench distills 500 tasks across six domains (e.g., Games, Science, and Tools). Furthermore, to tackle the challenge of evaluating open-ended interactions where no single ground truth exists, we propose MiniAppEval, an agentic evaluation framework. Leveraging browser automation, it performs human-like exploratory testing to systematically assess applications across three dimensions: Intention, Static, and Dynamic. Our experiments reveal that current LLMs still face significant challenges in generating high-quality MiniApps, while MiniAppEval demonstrates high alignment with human judgment, establishing a reliable standard for future research. Our code is available in github.com/MiniAppBench.

Community

Hi everyone! We are excited to introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation by LLMs. πŸš€

While traditional benchmarks focus on static layouts or algorithmic snippets, we shift the paradigm toward MiniAppsβ€”evaluating whether models can generate HTML-based applications requiring both visual rendering and complex interaction logic (e.g., physics simulators, interactive games).

πŸ† The Shocking Reality: GLM-5 Dethrones GPT-5.4

We evaluated 20 top models using our agentic framework. The results on complex interactive tasks are eye-opening:

  • πŸ₯‡ GLM-5 (61.80%) narrowly beats πŸ₯ˆ Claude-Opus-4.6 (61.60%).
  • πŸ₯‰ GPT-5.4 (56.60%) shows a severe "difficulty cliff": While it dominates Easy tasks (82.31%), it crashes to 35.03% on Hard tasks (requiring complex state transitions). Meanwhile, GLM-5 and Claude remain robust at ~45% on Hard tasks.
  • The gap between static coding ability and interactive application generation is massive.

(Feel free to check our interactive Leaderboard below for the full breakdown!)

🌍 Why MiniAppBench?

  • Real-World Scale: Distilled from 10M+ in-the-wild human-AI interaction traces.
  • Agentic Evaluation (MiniAppEval): We don't just string-match code. Our framework uses a browser-automation Agent to click, drag, and test the live generated apps, capturing DOM states and sequential logic (Pearson r > 0.85 with human judges).

⚑️ Zero Integration Cost: Test Your Model in 5 Mins

We know configuring evaluation harnesses is a pain. That’s why we open-sourced the entire end-to-end scaffolding.
Just bring your OpenAI-compatible API Key. No extra parsing scripts needed.

# 1. Clone & Install
git clone https://github.com/MiniAppBench/miniappbench.git
cd miniappbench && pip install -r requirements.txt
playwright install chromium

# 2. Run the full pipeline (Generation -> Agentic Evaluation)
python -m examples.pipeline --query-file data/query_validation_100.json --batch "1-5"

πŸ”— Links

We’d love to hear your thoughts, especially on the performance divergence between models on "Hard" interactive tasks!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.09652 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.09652 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.09652 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.