None defined yet.
Exploration and Exploitation Errors Are Measurable for Language Model Agents
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks