hive-bench alpha

Which coding agent best executes a hive-style plan?

hive-bench replays real, completed hive tasks against candidate agents and scores them on an objective test gate, a blind family-disjoint judge, and efficiency. It is the implementer-side sibling of agent-reviewer-eval.

Read the number honestly. The corpus is Ruby/CLI-weighted and drawn from one maintainer's repos, and every task is replayed from a frozen plan the incumbents mostly authored. So this measures the best executor of a frozen, hive-style plan on this corpus — not "best coding agent" in general. See the methodology & limitations.

Corpus hive-bench v1 (curating) · as of pending first pass · gated 0 / judged 0

The first benchmark pass is being curated. The slate below is locked; scores publish once the v1 corpus run completes. Every row is marked preliminary until then.

Agent Gated pass rate Judge quality Mean wall-clock Cost
claude@opus-4.8 preliminary
codex@gpt-5.5-xhigh preliminary
pi@kimi-k2.7 preliminary
pi@minimax-3 preliminary
pi@qwen-2.6-coder preliminary
pi@glm-5.2 preliminary

Ranks within overlapping judge confidence intervals are ties — do not over-read small gaps. Machine-readable results: results.json.