Methodology & limitations

What one cell measures

One cell is one corpus task run by one candidate configuration. The published board has six tasks and eleven candidates across three completed campaigns, for 66 generated cells. A candidate can use one model across the workflow or split stages between models; for example, sol-plan->grok-exec-sol-review uses Sol to plan and review and Grok to execute.

Each task is a real completed Hive task with a merged reference PR. The runner rewinds a source clone to the task’s base commit and supplies its frozen candidate-visible inputs. The candidate does not receive the reference patch. It then runs the real Hive cycle in an isolated runner: planning, implementation, a sandbox-local pull request, and Hive’s production review and fix loop. The scored artifact is the final captured candidate diff. When review finalization failed, the harness restored the saved post-execute diff rather than scoring partial review side effects.

Workflow and reviewer configuration

Compound Engineering powered planning through /ce-plan. Implementation then used Hive’s normal plan-driven development stage; it did not invoke /ce-work. Hive opened a benchmark-local pull request and ran its production review, courageous triage, and fix loop for at most two passes and two hours. No review CI command or browser test was configured for these campaigns.

The production reviewers below were part of the candidate workflow and could change the diff before scoring. They are separate from the Fable and Sol judges, which only evaluated the finished artifact afterward.

Candidate	Stage owners	Pre-score reviewer panel
GPT-5.6 Sol xhigh	Plan, implement, PR, triage, fix: Sol xhigh	`codex-ce-code-review` using Sol xhigh
Sol plan → Sol execute → Sol + Grok review	Plan: Sol xhigh; Implement: Sol high; PR, triage, fix: Sol xhigh	`codex-ce-code-review` using Sol xhigh and `grok-ce-code-review` using Grok 4.5 xhigh
Sol plan → Grok execute → Sol review	Plan, PR, triage, fix: Sol xhigh; Implement: Grok 4.5 xhigh	`codex-ce-code-review` using Sol xhigh
Sol plan → Terra execute → Sol review	Plan, PR, triage, fix: Sol xhigh; Implement: Terra xhigh	`codex-ce-code-review` using Sol xhigh
Sol plan → Terra execute → Grok review	Plan: Sol xhigh; Implement: Terra xhigh; PR, triage, fix: Grok 4.5 xhigh	`grok-ce-code-review` using Grok 4.5 xhigh
Fable plan → Grok execute → Sol review	Plan: Fable 5 high; Implement: Grok 4.5 xhigh; PR, triage, fix: Sol xhigh	`codex-ce-code-review` using Sol xhigh
Opus plan → Codex 5.5 xhigh	Plan, PR, triage, fix: Opus 4.8; Implement: Codex 5.5 xhigh	`claude-ce-code-review` using Opus; `codex-ce-code-review` using Codex; `pr-review-toolkit` using Opus
Opus 4.8	Plan, implement, PR, triage, fix: Opus 4.8	`claude-ce-code-review` and `pr-review-toolkit`, both using Opus
Grok 4.5 xhigh	Plan, implement, PR, triage, fix: Grok 4.5 xhigh	`grok-ce-code-review` using Grok's embedded CE review template
Codex 5.5 xhigh	Plan, implement, PR, triage, fix: Codex 5.5 xhigh	`codex-ce-code-review` using Codex 5.5 xhigh
GLM 5.2	Plan, implement, PR, triage, fix: GLM 5.2 through Pi	`pi-ce-code-review` using GLM 5.2 through Pi

The original publication commit’s harness defines the earlier candidate stage assignments and review-panel derivation. The two production-panel rows come from a later pre-registered campaign. The evidence bundles do not bind every result to a harness revision or contain each resolved config.yml, so the table documents the publication records, not independently serialized per-cell configuration provenance.

Current scoring

Two judges independently score every final diff from 0–10 against the task and merged reference:

Fable 5, pinned to xhigh in all three campaigns.
GPT-5.6 Sol, pinned to xhigh for the original campaign and ultra for both three-seed follow-ups.

The merged PR tells a judge what the task ultimately required; candidates are not rewarded for textual or structural similarity. The original 36 cells have one score sample per judge; the 30 cells in the two later campaigns have three samples per judge. The bold independent score for the original rows and independent three-sample mean for the later rows remain the leaderboard’s primary evidence. The public summary table opens ordered by the paired discussion-final mean, while the independent scores remain visible and sortable. All 66 cells then received an adversarial deliberation pass. The secondary discussion final is a separate one-shot diagnostic run with campaign-specific round-one provenance. The original campaign reused its exact published independent verdicts and rationales recovered from exact local provider sessions; both three-seed follow-ups freshly re-graded round one. In all three campaigns, each judge then received the other judge’s verdict anonymously, argued the strongest evidence-based case that its own view was wrong, and held or revised. It is not an adjustment applied to the independent score or mean.

The site publishes all 132 discussion-final judge decisions across all 66 cells. Reusing the original campaign’s published initial verdicts means its round two did not rerun independent scoring. Sol’s originally interrupted second-round decision for the Fable-plan/Grok-execute daemon follow-up cell was also recovered by replaying only round two from the preserved pair of round-one verdicts, plan, diff, and reference. Discussion finals do not replace the independent leaderboard because exposing verdicts can add anchoring or convergence pressure even when it also surfaces genuine misses. The default After discussion sort orders all eleven rows by their paired final mean; choosing it as the presentation default does not rewrite the independent scores. The per-task board displays Fable before Sol for both layers.

Judge calibrations are not interchangeable. The site keeps separate Fable and Sol columns as the primary evidence and uses their arithmetic mean only as a presentation aid to sort one compact leaderboard. It is not a third judge or a claim that the two rulers share a scale. A score is marked same-family when that judge shares a model family with any model in the candidate configuration. Those scores remain visible but should be treated as weaker evidence because self-preference cannot be ruled out.

GPT-5.5 Pro is a historical supplemental ruler. It scored only one cell for Codex 5.5 xhigh, GLM 5.2, and Grok 4.5. Those three observations remain in the published score data but are not used in the compact leaderboard ranking.

Coverage and objective evidence

All 66 candidate runs produced scoreable patches, with no pending or failed generation cells. All 66 exact final patches are public. The 30 cells across the two three-seed campaigns also publish their complete score samples and intervals in the site snapshot. Every objective-gate record is no_gate: the corpus does not yet have curated held-out tests suitable for a fair candidate-independent pass/fail claim. The current numbers are judge evidence, not test-pass rates.

The site links every score to a machine-readable record. For the original campaign, the manifest, complete merged results.json, exact candidate patches, and evidence directories are public in hive-bench. The two follow-ups’ three-sample distributions are included in the site’s data snapshot, and every later cell links directly to its candidate patch. Raw provider streams, build logs, target clones, and auth material remain unpublished.

Time, tokens, and cost

Wall time is recorded per task where recoverable. The original campaign has 32 timed cells; the first three-seed follow-up has 14, and the production-panel follow-up has all 12. A displayed time mean uses only those recorded cells and always shows its sample count.

For the original campaign, generation tokens and API-equivalent costs are recomputed uniformly from per-event stage logs with HiveBench::TokenReport. Session-cumulative result and system events are excluded rather than counted again. Codex usage is normalized by subtracting cached_input_tokens from its inclusive input count, then pricing cached and uncached input separately. Event model ids provide the first attribution signal; stages provide the fallback for Codex events that do not carry a model id. Claude’s internal Haiku utility calls remain a separate priced model instead of being charged at the Opus rate.

The first follow-up retains the per-event stage logs used by HiveBench::TokenReport. Five of the six Sol-plan/Terra-execute cells preserve complete Codex events, so the site publishes their normalized token splits and API-equivalent costs. The recovered fix-review artifact retains only aggregate inclusive-input usage; its missing cached-input split makes its comparable tokens and price unknown. Grok emits no usable token events through this runner. The two Grok-execution rows therefore publish explicitly labeled known-provider subtotals, never complete workflow totals. Sol-plan/Grok-execute/Sol-review includes retained Sol plan/review usage for all six tasks. Fable-plan/Grok-execute/Sol-review includes Fable planning for all six tasks plus retained Sol review events where present. The production-panel campaign retains comparable Sol/Terra events for all 12 cells. Sol-plan/Terra-execute/Grok-review therefore publishes Sol plus Terra subtotals; the production-like Sol/Sol/Sol+Grok configuration publishes its Sol stage subtotal. Grok reviewer usage remains absent from both. No Grok usage is imputed.

All displayed wall times come from the corresponding serialized wall_clock_sec values. The per-event source logs used to reconstruct normalized token splits and costs are not public. Those displayed values are available in the site’s data snapshot, but visitors cannot yet independently rerun that accounting from the public evidence bundle.

The normalized token total includes four non-overlapping buckets: fresh input, output, cache reads, and cache creation/writes. The site publishes the split for every measured candidate/task cell and for each candidate average. “Cache” is therefore not subtracted from the displayed total: the cache read and write figures are components of that total. This distinction matters because most of the observed token volume is cache reuse rather than fresh input.

That per-model attribution makes the mixed candidate priceable: across all six tasks, Codex 5.5 contributes $67.3351, Opus 4.8 contributes $47.6319, and Haiku utility calls contribute $0.6159, for $115.5829 total or an average of $19.26 per task. The site also publishes every task-level cost, token-bucket split, and recorded wall time rather than only the mean.

Costs use the versioned 2026-06-usual price table and are descriptive API-equivalent estimates, not a billing claim. Judge usage is excluded. Grok’s runner emits no usable token events, so workflows that use Grok keep their complete-workflow token totals and cost fields unknown, not zero. The displayed known-provider token and cost subtotals are explicitly marked partial and are not used to claim a full-workflow price or cost-sort value. No missing value is imputed from another provider or model.

Known limitations

Small, single-project corpus. Six Ruby/CLI tasks from one repository do not support a universal “best coding model” claim.
Mixed sampling depth. The original 36 cells have one sample per primary judge; the 30 later cells have three. Stability intervals therefore exist only for the later rows, and close gaps in the original cohort may reverse.
No objective gates. Human-aligned judge scoring is the only current quality signal.
Judge-family overlap. Every follow-up candidate uses Sol for at least one production stage and is judged by Sol; the Fable-planned candidate is also judged by Fable. Flags disclose this but cannot remove the bias.
Corpus provenance can frame the task. All original task plans were Claude-authored, even though candidates re-ran the workflow themselves.
Recovered telemetry is incomplete. Some finished cells predate complete wall-time capture, and Grok exposes no usable token stream. The site labels every affected task cell and aggregate.
Token and cost source logs are not published. The normalized site snapshot is public, but the underlying provider streams needed to reproduce that accounting are intentionally excluded from the evidence bundle. Published per-cell wall times remain directly checkable.
Post-merge references. The reference PRs are human-reviewed outcomes. They are strong task evidence, but model training contamination cannot be ruled out for already-public PRs.

The scoped claim is therefore: what these configurations shipped through the full Hive workflow on this corpus, not which model is universally best.