ClawBench Leaderboard

Intercepted (sort key) = agent's final HTTP request matched the task's URL/method schema — Stage 1, deterministic, no judge. Reward = additionally requires the LLM judge (default deepseek/deepseek-v4-pro) to confirm the payload fulfilled the instruction — Stage 2. Rows are ranked by Intercepted DESC, then Reward DESC as tiebreak. V2 is Hermes-only; alternative harnesses are evaluated separately. Partial = batch attempted fewer than the full corpus (mid-run abort / queue cap); rates are over attempted, not over corpus.

Corpus

all v2 v1

Harness

hermes

Rank	Model	Harness	Corpus	Intercepted	Reward	Pass	Total	Wall (h)
1	openrouter-owl-alpha	hermes	v2	54.67%	13.33%	10	130	—

Rank	Model	Harness	Corpus	Intercepted	Reward	Pass	Total	Wall (h)
1	claude-opus-4-7	hermes	v2	54.67%	13.33%	10	75	—
2	glm-5.1	hermes	v2	48.46%	18.46%	24	130	—
3	gpt-5.5	hermes	v2	48.15%	11.11%	9	81	—
4	deepseek-v4-pro	hermes	v2	43.85%	10.00%	13	130	—
5	openrouter-owl-alpha	hermes	v2	14.62%	4.62%	6	130	—
6	deepseek-v4-flash	hermes	v2	3.08%	1.54%	2	130	—

About ClawBench

Why a new benchmark?

Existing browser-agent benchmarks either run on synthetic / sandboxed websites (WebArena, VisualWebArena) or only check whether the agent reached the endpoint (WebVoyager). ClawBench runs on live, real-world websites and verifies the payload the agent submitted — so an agent that types the wrong delivery address into Uber Eats fails, even if its last HTTP request hit the correct endpoint.

Two corpora

V1 — 153 tasks across 144 real websites (the paper).
V2 — 130 newer everyday tasks across 63 platforms, expanded coverage of e-commerce / form-filling / authentication-walled flows.

Two-stage scoring

Stage	What it checks	Output
1. Interception	Did the final HTTP request match the task's URL + method + canonical body schema?	`intercepted ∈ {true, false}`
2. Judge	Given the natural-language instruction and the intercepted payload, did the agent submit the right thing?	`match ∈ {true, false, null}`

Reward = Intercepted ∧ Match. Full prompt + judge model details: eval/scoring.md ↗

What ships with every run

A 5-layer trace bundle (downloadable from the Traces datasets above):

recording.mp4 — full browser session video
actions.jsonl — every click / type / scroll
agent-messages.jsonl — model inputs & outputs (incl. reasoning)
requests.jsonl — every HTTP request the page made
interception.json — graded final request
run-meta.json — model, harness, scores, timing

Reproducing

pip install clawbench-eval
clawbench run --model <your-model> --harness hermes --corpus v2
python scripts/clawbench_rescore.py --judge-model deepseek-v4-pro --only-batch <your-batch-dir>

🚀 Submit your model

Submissions are accepted as PRs to the leaderboard CSV in the dataset repo:

Open the CSV in the dataset repo ↗

Required steps

Run the benchmark — install pip install clawbench-eval, then clawbench run --model <your-model> --harness hermes --corpus v2 (or v1). Use the included harnesses (hermes / openclaw) so traces follow the standard 5-layer format.
Score — python scripts/clawbench_rescore.py --judge-model deepseek-v4-pro --only-batch <your-batch-dir> produces rescore-summary.json with the cells you'll need.
Upload traces (recommended) — push the 5-layer run bundles to TIGER-Lab/ClawBenchV2Trace (or NAIL-Group/ClawBenchV1Trace) so others can audit.
Open a PR — add one row per (model, harness, corpus) to leaderboard/results.csv with columns: model,harness,dataset,passed,total,pass_rate,reward_rate,wall_hours. Link the trace bundle in the PR description.

We re-run a sample of your submitted traces with our judge before merging — to keep the table honest.

For step-by-step instructions, see eval/scoring.md.

🏆 ClawBench — Web Agent Benchmark