BC-Bench Guide

BC-Bench: Step-by-Step Guide

Part 1 — Introduction and Core Concepts


What is BC-Bench?

BC-Bench is an open-source benchmarking framework created by Microsoft to evaluate coding agents on real-world Microsoft Dynamics 365 Business Central development tasks using the AL language.

It is inspired by SWE-Bench, the reference benchmark for evaluating AI agents on software engineering tasks, but adapted to the Business Central ecosystem with its specific requirements: AL app compilation, BC containers, codeunit tests, and the BCApps project structure.

What is it for?

BC-Bench allows you to answer concrete questions such as:

How does it work at a high level?

1. DATASET: 101 real Business Central bugs with gold patches and tests
       |
2. AGENT: A coding agent (Claude Code, Copilot CLI, mini-bc-agent) attempts to fix the bug
       |
3. EVALUATION: The test patch is applied, code is compiled, and tests run inside a BC container
       |
4. RESULT: resolved/failed + metrics (tokens, time, tools used)

Each task in the dataset comes from a real bug reported in the BCApps or NAV repository. The dataset includes:


Repository Architecture

BC-Bench/
|-- src/bcbench/           # Evaluation harness (Python)
|   |-- agent/             # Agent implementations (mini, claude, copilot)
|   |   |-- shared/        # Shared configuration: prompts, skills, instructions
|   |   |   |-- config.yaml           # Main agent configuration
|   |   |   |-- instructions/         # Custom instructions per repository
|   |   |       |-- microsoft-BCApps/ # Instructions, skills, agents for BCApps
|   |   |       |-- microsoftInternal-NAV/  # Instructions for NAV
|   |   |-- mini/          # mini-bc-agent (minimal baseline)
|   |   |-- claude/        # Claude Code integration
|   |   |-- copilot/       # GitHub Copilot CLI integration
|   |-- commands/          # CLI commands (run, evaluate, dataset, result)
|   |-- evaluate/          # Evaluation pipelines (bug-fix, test-generation)
|   |-- results/           # Result classes, metrics, leaderboard
|   |-- operations/        # BC operations (build, test, git, setup)
|-- dataset/
|   |-- bcbench.jsonl      # Main dataset (101 entries)
|   |-- problemstatement/  # Problem statements with README.md and images
|-- scripts/               # PowerShell scripts for VM and evaluation
|-- notebooks/             # Jupyter notebooks for result analysis
|-- docs/                  # Documentation and leaderboard
|-- tests/                 # Harness unit tests

Key Concepts

Evaluation Categories

BC-Bench supports two categories:

Category What the agent does How it is evaluated
bug-fix Receives the bug description and must produce the patch that fixes it The test_patch is applied, compiled, and tests are run. If they pass: resolved
test-generation Receives the bug (and optionally the fix) and must generate tests that reproduce it Tests must FAIL against unfixed code and PASS with the fix applied

Available Agents

Agent Description Primary Use
mini-bc-agent Minimal loop based on mini-swe-agent. PowerShell only. Reference baseline
Claude Code Anthropic’s agentic tool. Supports MCP, custom instructions, agents. Advanced evaluation
GitHub Copilot CLI GitHub Copilot CLI. Supports MCP, tools, agent mode. Advanced evaluation

Comparison Scenarios

A scenario is a combination of agent + configuration. Typical scenarios are:

Scenario What it includes
baseline Agent with no custom instructions, no skills, no custom agents
with instructions Agent + custom instructions (CLAUDE.md or copilot-instructions.md)
with instructions + skills Agent + instructions + specialized skills (bugfix, debug, testing…)
with specialized agent All of the above + a custom agent (al-developer-bench, al-conductor-bench…)

Dataset Entry Structure

Each line in dataset/bcbench.jsonl is a JSON object with this structure:

{
  "instance_id": "microsoft__BCApps-4822",
  "repo": "microsoft/BCApps",
  "base_commit": "a1b2c3d4...",
  "created_at": "2025-01-15",
  "environment_setup_version": "26.0",
  "project_paths": [
    "App\\Apps\\W1\\Shopify\\app",
    "App\\Apps\\W1\\Shopify\\test"
  ],
  "FAIL_TO_PASS": [
    {
      "codeunitID": 139648,
      "functionName": ["UnitTestSuggestShopifyPaymentsFailedTransaction"]
    }
  ],
  "PASS_TO_PASS": [],
  "patch": "--- a/...\n+++ b/...\n@@ ...",
  "test_patch": "--- a/...\n+++ b/...\n@@ ...",
  "metadata": {
    "area": "shopify",
    "image_count": 3
  }
}

Key fields:

Problem Statements

Each entry has a directory at dataset/problemstatement/{instance_id}/ with:

Typical problem statement example:

# Title: Shopify - Export customer as location - Sell-to and Bill-to are missing
## Repro Steps:
1. Export two companies to Shopify
2. One normal, another with a different bill-to defined
...
**ACTUAL RESULT:** Sell-to and Bill-to fields are empty
**EXPECTED RESULT:** Fields should be populated correctly

Full Evaluation Flow

This is the flow for each individual evaluation (bug-fix category):

1. SETUP
   |-- Clean repository (git clean, checkout base_commit)
   |-- Copy problem statement into the repo
   |-- Compile base projects (timeout: 30 min BaseApp, 5 min apps)
   |-- Copy instructions/skills/agents (if enabled)

2. AGENT EXECUTION
   |-- Build prompt with task and context
   |-- Configure MCP servers (if --al-mcp is enabled)
   |-- Execute agent (timeout: 60 min)
   |-- Capture metrics: tokens, time, tools, turns

3. EVALUATION
   |-- Capture agent-generated patch (git diff)
   |-- Apply test_patch (adds validation tests)
   |-- Compile with tests (timeout: 5 min)
   |-- Run tests (timeout: 3 min per test)
   |-- Determine result: resolved / build-failure / test-failure

4. RESULT
   |-- Save to {instance_id}.jsonl
   |-- Includes: resolved, build, patch, metrics, configuration

Default Timeouts

Operation Timeout
BaseApp compilation 30 minutes
App compilation 5 minutes
Test execution 3 minutes
Agent execution 60 minutes

Next: Part 2 — Installation, Setup and First Evaluation