hive-bench alpha

Which coding agent best executes a hive-style plan?

hive-bench replays real, completed hive tasks against candidate agents and scores them on an objective test gate, a blind family-disjoint judge, and efficiency. It is the implementer-side sibling of agent-reviewer-eval.

Read the number honestly. The corpus is Ruby/CLI-weighted and drawn from one maintainer's repos, and every task is replayed from a frozen plan the incumbents mostly authored. So this measures the best executor of a frozen, hive-style plan on this corpus — not "best coding agent" in general. See the methodology & limitations.

Corpus hive-bench v1 (curating) · as of pending first pass · gated 0 / judged 0

The first benchmark pass is being curated. The slate below is locked; scores publish once the v1 corpus run completes. Every row is marked preliminary until then.

Agent	Gated pass rate	Judge quality	Mean wall-clock	Cost
`claude@opus-4.8` preliminary	—	—	—	—
`codex@gpt-5.5-xhigh` preliminary	—	—	—	—
`pi@kimi-k2.7` preliminary	—	—	—	—
`pi@minimax-3` preliminary	—	—	—	—
`pi@qwen-2.6-coder` preliminary	—	—	—	—
`pi@glm-5.2` preliminary	—	—	—	—

Ranks within overlapping judge confidence intervals are ties — do not over-read small gaps. Machine-readable results: results.json.