After an evaluation, results are saved in the output directory with this structure:
evaluation_results/
|-- {run_id}/
|-- {instance_id}.jsonl # Per-instance result
|-- evaluation_summary.json # Aggregated summary
|-- bceval_results.jsonl # Export in bceval format
|-- claude_debug.log # Debug log (Claude Code)
|-- copilot-*.log # Debug log (Copilot)
Each .jsonl file contains one JSON line per result with:
{
"instance_id": "microsoft__BCApps-4822",
"project": "BCApps",
"model": "claude-sonnet-4-6",
"agent_name": "Claude Code",
"category": "bug-fix",
"resolved": true,
"build": true,
"timeout": false,
"generated_patch": "--- a/...\n+++ b/...",
"error_message": null,
"benchmark_version": "0.4.0",
"metrics": {
"execution_time": 145.3,
"llm_duration": 120.5,
"turn_count": 12,
"prompt_tokens": 450000,
"completion_tokens": 3500,
"tool_usage": { "Read": 15, "Grep": 8, "Edit": 3 }
},
"experiment": {
"mcp_servers": ["altool"],
"custom_instructions": true,
"skills_enabled": true,
"custom_agent": "al-developer-bench"
}
}
uv run bcbench result summarize {run_id}
Reads all .jsonl files from evaluation_results/{run_id}/, calculates aggregated metrics and displays a console summary:
Also generates:
evaluation_summary.json — Summary in JSON formatbceval_results.jsonl — Export for external analysisuv run bcbench result aggregate --input-dir notebooks/result/bug-fix/my-directory/
Similar to summarize, but does not require a specific run_id. Recursively scans the directory for all result .jsonl files.
# Review all unresolved results from a file
uv run bcbench result review results.jsonl --category bug-fix
# Review a specific instance across all runs in a directory
uv run bcbench result review notebooks/result/bug-fix/ \
--category bug-fix \
--instance-id microsoft__BCApps-4822
# Include resolved instances too
uv run bcbench result review results.jsonl \
--category bug-fix \
--include-resolved
The TUI shows a split view: expected patch (gold) vs generated patch (agent). Use j/k or arrow keys to navigate, 1-7 to classify failure categories.
uv run bcbench result update evaluation_results/{run_id}/evaluation_summary.json
Takes a run summary and adds it to the public leaderboard at docs/_data/{category}.json. Stores up to 5 runs per agent + model + configuration combination.
uv run bcbench result refresh
Recalculates leaderboard aggregated metrics without adding new data. Useful when aggregation logic changes.
BC-Bench calculates advanced statistical metrics:
pass_rate = resolved / total * 100
The basic metric: percentage of resolved instances.
BC-Bench uses BCa (bias-corrected and accelerated) bootstrapping to calculate 95% confidence intervals. This answers: “How confident can we be that the pass rate is X%?”
Probability of getting at least 1 correct solution in k attempts:
pass@k = 1 - C(n-c, k) / C(n, k)
Where n = total, c = correct. Useful when running multiple runs of the same scenario.
Probability that ALL k attempts are correct:
pass^k = estimated probability of consistent success
BC-Bench includes Jupyter notebooks for visual result analysis:
notebooks/
|-- dataset.ipynb # Dataset statistics
|-- bug-fix/
| |-- overview.ipynb # Bug-fix overview
| |-- claude-vs-copilot.ipynb # Claude vs Copilot comparison
| |-- failure-analysis.ipynb # Failure analysis
| |-- altool-comparison.ipynb # AL MCP impact
| |-- claude-code-aldc-comparison.ipynb # With/without custom config comparison
| |-- aldc-comparison-full.ipynb # Full scenario comparison
|-- test-generation/
|-- overview.ipynb # Test-generation overview
|-- altest-comparison.ipynb # Configuration comparison
# Install analysis dependencies
uv sync --group analysis
# Launch Jupyter
uv run jupyter lab
Analysis dependencies include: pandas, plotly, ipykernel, nbformat.
Notebooks read aggregated data from:
docs/_data/bug-fix.json
docs/_data/test-generation.json
Leaderboard structure:
{
"runs": [
{
"total": 101,
"resolved": 57,
"failed": 44,
"build": 100,
"percentage": 56.4,
"date": "2026-01-29",
"model": "claude-opus-4-5",
"agent_name": "Claude Code",
"category": "bug-fix",
"average_duration": 197.3,
"average_prompt_tokens": 685413.2,
"average_completion_tokens": 4821.8,
"instance_results": {
"microsoftInternal__NAV-204450": false,
"microsoftInternal__NAV-179733": true
}
}
],
"aggregate": [...]
}
# On the VM
.\scripts\Run-FullComparison.ps1 `
-InstanceIds @("microsoft__BCApps-4822") `
-LlmFamily sonnet `
-OnlyMissing
# On your development machine
uv run bcbench result aggregate --input-dir notebooks/result/bug-fix/claude-baseline-sonnet-4-6/
uv run bcbench result aggregate --input-dir notebooks/result/bug-fix/claude-aldc-developer-sonnet-4-6/
uv run bcbench result update evaluation_results/my_run/evaluation_summary.json
# Review instances that failed
uv run bcbench result review notebooks/result/bug-fix/claude-baseline-sonnet-4-6/ \
--category bug-fix \
--instance-id microsoft__BCApps-4822
uv run jupyter lab notebooks/bug-fix/overview.ipynb
BC-Bench uses semantic versioning. The current version is defined in pyproject.toml:
[project]
version = "0.4.0"
| Change type | Bump | Examples |
|---|---|---|
| Major (X.0.0) | Dataset or methodology changes | Adding/removing entries, changing pass criteria |
| Minor (0.X.0) | Tooling changes that affect results | Updating Copilot version, changing agent prompts |
| Patch (0.0.X) | Fixes and documentation | Fixing a parsing bug, updating docs |
Results from different versions CANNOT be aggregated together. The bcbench result update command validates that all runs have the same benchmark_version. This ensures the leaderboard always compares apples to apples.
When documenting evaluation results, make sure to include:
pyproject.toml)bcbench run mini-inspector for trajectories)environment_setup_versionbase_commit-PauseBetweenScenarios 180 (3 minutes between scenarios)docker ps to see if it’s runningdocker rm -f bcbench and re-run setuppyproject.tomlBack to index: Guide Index