hive-bench alpha
Which coding agent best executes a hive-style plan?
hive-bench replays real, completed hive tasks against candidate agents and scores them on an objective test gate, a blind family-disjoint judge, and efficiency. It is the implementer-side sibling of agent-reviewer-eval.
Read the number honestly. The corpus is Ruby/CLI-weighted and drawn from one maintainer's repos, and every task is replayed from a frozen plan the incumbents mostly authored. So this measures the best executor of a frozen, hive-style plan on this corpus — not "best coding agent" in general. See the methodology & limitations.
The first benchmark pass is being curated. The slate below is locked; scores publish once the v1 corpus run completes. Every row is marked preliminary until then.
| Agent | Gated pass rate | Judge quality | Mean wall-clock | Cost |
|---|---|---|---|---|
claude@opus-4.8 preliminary |
— | — | — | — |
codex@gpt-5.5-xhigh preliminary |
— | — | — | — |
pi@kimi-k2.7 preliminary |
— | — | — | — |
pi@minimax-3 preliminary |
— | — | — | — |
pi@qwen-2.6-coder preliminary |
— | — | — | — |
pi@glm-5.2 preliminary |
— | — | — | — |
Ranks within overlapping judge confidence intervals are ties — do not
over-read small gaps. Machine-readable results:
results.json.