πŸ† ClawBench β€” Web Agent Benchmark

Can AI agents complete everyday online tasks? ClawBench scores agents on real, live websites (booking flights, ordering groceries, submitting job applications). Two corpora: V1 β€” 153 tasks across 144 websites Β· V2 β€” 130 newer tasks across 63 platforms. Every run is graded twice: a deterministic HTTP-request interception check (Stage 1, the sort key) β€” then an LLM judge on the intercepted payload (Stage 2 = Reward).

πŸ“– Paper Β· πŸ’» GitHub Β· πŸ—‚ Dataset Β· 🎞 Traces V1 Β· 🎞 Traces V2 Β· 🌐 Site

Intercepted (sort key) = agent's final HTTP request matched the task's URL/method schema β€” Stage 1, deterministic, no judge. Reward = additionally requires the LLM judge (default deepseek/deepseek-v4-pro) to confirm the payload fulfilled the instruction β€” Stage 2. Rows are ranked by Intercepted DESC, then Reward DESC as tiebreak. V2 is Hermes-only; alternative harnesses are evaluated separately. Partial = batch attempted fewer than the full corpus (mid-run abort / queue cap); rates are over attempted, not over corpus.

Corpus
Harness
Rank
Model
Harness
Corpus
Intercepted
Reward
Pass
Total
Wall (h)
1
openrouter-owl-alpha
hermes
v2
54.67%
13.33%
10
130
β€”