BC-Bench is an open-source benchmarking framework created by Microsoft to evaluate coding agents on real-world Microsoft Dynamics 365 Business Central development tasks using the AL language.
It is inspired by SWE-Bench, the reference benchmark for evaluating AI agents on software engineering tasks, but adapted to the Business Central ecosystem with its specific requirements: AL app compilation, BC containers, codeunit tests, and the BCApps project structure.
BC-Bench allows you to answer concrete questions such as:
1. DATASET: 101 real Business Central bugs with gold patches and tests
|
2. AGENT: A coding agent (Claude Code, Copilot CLI, mini-bc-agent) attempts to fix the bug
|
3. EVALUATION: The test patch is applied, code is compiled, and tests run inside a BC container
|
4. RESULT: resolved/failed + metrics (tokens, time, tools used)
Each task in the dataset comes from a real bug reported in the BCApps or NAV repository. The dataset includes:
BC-Bench/
|-- src/bcbench/ # Evaluation harness (Python)
| |-- agent/ # Agent implementations (mini, claude, copilot)
| | |-- shared/ # Shared configuration: prompts, skills, instructions
| | | |-- config.yaml # Main agent configuration
| | | |-- instructions/ # Custom instructions per repository
| | | |-- microsoft-BCApps/ # Instructions, skills, agents for BCApps
| | | |-- microsoftInternal-NAV/ # Instructions for NAV
| | |-- mini/ # mini-bc-agent (minimal baseline)
| | |-- claude/ # Claude Code integration
| | |-- copilot/ # GitHub Copilot CLI integration
| |-- commands/ # CLI commands (run, evaluate, dataset, result)
| |-- evaluate/ # Evaluation pipelines (bug-fix, test-generation)
| |-- results/ # Result classes, metrics, leaderboard
| |-- operations/ # BC operations (build, test, git, setup)
|-- dataset/
| |-- bcbench.jsonl # Main dataset (101 entries)
| |-- problemstatement/ # Problem statements with README.md and images
|-- scripts/ # PowerShell scripts for VM and evaluation
|-- notebooks/ # Jupyter notebooks for result analysis
|-- docs/ # Documentation and leaderboard
|-- tests/ # Harness unit tests
BC-Bench supports two categories:
| Category | What the agent does | How it is evaluated |
|---|---|---|
| bug-fix | Receives the bug description and must produce the patch that fixes it | The test_patch is applied, compiled, and tests are run. If they pass: resolved |
| test-generation | Receives the bug (and optionally the fix) and must generate tests that reproduce it | Tests must FAIL against unfixed code and PASS with the fix applied |
| Agent | Description | Primary Use |
|---|---|---|
| mini-bc-agent | Minimal loop based on mini-swe-agent. PowerShell only. | Reference baseline |
| Claude Code | Anthropic’s agentic tool. Supports MCP, custom instructions, agents. | Advanced evaluation |
| GitHub Copilot CLI | GitHub Copilot CLI. Supports MCP, tools, agent mode. | Advanced evaluation |
A scenario is a combination of agent + configuration. Typical scenarios are:
| Scenario | What it includes |
|---|---|
| baseline | Agent with no custom instructions, no skills, no custom agents |
| with instructions | Agent + custom instructions (CLAUDE.md or copilot-instructions.md) |
| with instructions + skills | Agent + instructions + specialized skills (bugfix, debug, testing…) |
| with specialized agent | All of the above + a custom agent (al-developer-bench, al-conductor-bench…) |
Each line in dataset/bcbench.jsonl is a JSON object with this structure:
{
"instance_id": "microsoft__BCApps-4822",
"repo": "microsoft/BCApps",
"base_commit": "a1b2c3d4...",
"created_at": "2025-01-15",
"environment_setup_version": "26.0",
"project_paths": [
"App\\Apps\\W1\\Shopify\\app",
"App\\Apps\\W1\\Shopify\\test"
],
"FAIL_TO_PASS": [
{
"codeunitID": 139648,
"functionName": ["UnitTestSuggestShopifyPaymentsFailedTransaction"]
}
],
"PASS_TO_PASS": [],
"patch": "--- a/...\n+++ b/...\n@@ ...",
"test_patch": "--- a/...\n+++ b/...\n@@ ...",
"metadata": {
"area": "shopify",
"image_count": 3
}
}
Key fields:
{repo}__{issue-number}Each entry has a directory at dataset/problemstatement/{instance_id}/ with:
README.md — Bug description, reproduction steps, expected vs actual resultTypical problem statement example:
# Title: Shopify - Export customer as location - Sell-to and Bill-to are missing
## Repro Steps:
1. Export two companies to Shopify
2. One normal, another with a different bill-to defined
...
**ACTUAL RESULT:** Sell-to and Bill-to fields are empty
**EXPECTED RESULT:** Fields should be populated correctly
This is the flow for each individual evaluation (bug-fix category):
1. SETUP
|-- Clean repository (git clean, checkout base_commit)
|-- Copy problem statement into the repo
|-- Compile base projects (timeout: 30 min BaseApp, 5 min apps)
|-- Copy instructions/skills/agents (if enabled)
2. AGENT EXECUTION
|-- Build prompt with task and context
|-- Configure MCP servers (if --al-mcp is enabled)
|-- Execute agent (timeout: 60 min)
|-- Capture metrics: tokens, time, tools, turns
3. EVALUATION
|-- Capture agent-generated patch (git diff)
|-- Apply test_patch (adds validation tests)
|-- Compile with tests (timeout: 5 min)
|-- Run tests (timeout: 3 min per test)
|-- Determine result: resolved / build-failure / test-failure
4. RESULT
|-- Save to {instance_id}.jsonl
|-- Includes: resolved, build, patch, metrics, configuration
| Operation | Timeout |
|---|---|
| BaseApp compilation | 30 minutes |
| App compilation | 5 minutes |
| Test execution | 3 minutes |
| Agent execution | 60 minutes |