BC-Bench Guide

BC-Bench: Step-by-Step Guide

Part 2 — Installation, Setup and First Evaluation


Prerequisites

Required Software

Tool Minimum Version Purpose
uv Latest Python package manager (replaces pip/venv)
Python 3.13+ Harness runtime
Git 2.40+ Repository management
Node.js 22+ Required for Claude Code
Docker + Hyper-V Business Central containers (Windows only)
PowerShell 7+ Evaluation scripts (Windows only)

Agent Tools (depending on which you use)

Agent Installation Environment Variable
Claude Code npm install -g @anthropic-ai/claude-code ANTHROPIC_API_KEY
GitHub Copilot CLI Via GitHub CLI GITHUB_TOKEN
mini-bc-agent Included in BC-Bench AZURE_API_KEY, AZURE_API_BASE

PowerShell Modules (for evaluation with BC container)

Module Purpose
BcContainerHelper Create and manage Business Central containers
AL Tool (dotnet tool) AL MCP server (compilation, analysis)

Step-by-Step Installation

Step 1: Clone the repository

# Fork and clone (recommended for your own use)
gh repo fork microsoft/BC-Bench --clone
cd BC-Bench

# Or direct clone
git clone https://github.com/microsoft/BC-Bench.git
cd BC-Bench

Step 2: Install Python and dependencies

# Install Python 3.13 via uv
uv python install

# Install all dependencies (including dev and analysis)
uv sync --all-groups

# Install pre-commit hooks
uv run pre-commit install

Step 3: Verify the installation

# Show CLI help
uv run bcbench --help

You should see:

Usage: bcbench [OPTIONS] COMMAND [ARGS]...

BC-Bench: Benchmarking tool for Business Central (AL) ecosystem

Commands:
  collect    Collect dataset entries
  dataset    Query and analyze dataset
  evaluate   Evaluate agents on benchmark datasets
  result     Process and display evaluation results
  run        Run agents on single dataset entry

Step 4: Configure environment variables

Copy the sample file and fill in your credentials:

cp .env.sample .env

Key variables in .env:

# For Claude Code
ANTHROPIC_API_KEY=sk-ant-...

# For GitHub Copilot CLI
GITHUB_TOKEN=ghp_...

# For mini-bc-agent (Azure AI Foundry)
AZURE_API_KEY=...
AZURE_API_BASE=...
AZURE_API_VERSION=...

# For BC containers (full evaluation)
BC_CONTAINER_NAME=bcbench
BC_CONTAINER_USERNAME=admin
BC_CONTAINER_PASSWORD=YourPassword123!

Exploring the Dataset

Before running an evaluation, familiarize yourself with the dataset.

List all entries

uv run bcbench dataset list

This shows all 101 available entries:

Found 101 entry(ies):
  - microsoftInternal__NAV-210528
  - microsoftInternal__NAV-224009
  - microsoft__BCApps-4822
  ...

View entry details

uv run bcbench dataset view microsoft__BCApps-4822

Shows detailed information: repo, commit, BC version, project paths, tests, and the problem statement.

To also see the gold patch:

uv run bcbench dataset view microsoft__BCApps-4822 --show-patch

Browse the dataset with the interactive TUI

uv run bcbench dataset review

Opens a terminal interface with a split view: entry information on the left, problem statement on the right. Use arrow keys to navigate between entries.

To also show resolution statistics (if you have previous results):

uv run bcbench dataset review --results-dir notebooks/result/bug-fix/my-directory/

Your First Evaluation

There are two execution modes:

Mode Command What it does Requires BC container?
run bcbench run Only runs the agent and generates the patch No
evaluate bcbench evaluate Runs the agent + compiles + runs tests Yes

Quick mode: generate patch only (no container)

This mode is ideal for getting started. You don’t need a BC container — just the agent and the BCApps repository cloned locally.

With Claude Code

uv run bcbench run claude microsoft__BCApps-4822 \
  --category bug-fix \
  --model claude-sonnet-4-6 \
  --container-name bcbench \
  --repo-path /path/to/BCApps

With GitHub Copilot CLI

uv run bcbench run copilot microsoft__BCApps-4822 \
  --category bug-fix \
  --model claude-sonnet-4.6 \
  --container-name bcbench \
  --repo-path /path/to/BCApps

With mini-bc-agent

uv run bcbench run mini microsoft__BCApps-4822 \
  --category bug-fix \
  --model gpt-5.1-codex-mini \
  --repo-path /path/to/BCApps

Note on model IDs: Claude Code uses hyphens (claude-sonnet-4-6), Copilot uses dots (claude-sonnet-4.6). They are different ID formats for the same model.

Full mode: evaluation with build + tests

This mode requires a running Business Central container (Windows with Docker/Hyper-V).

uv run bcbench evaluate claude microsoft__BCApps-4822 \
  --category bug-fix \
  --model claude-sonnet-4-6 \
  --container-name bcbench \
  --username admin \
  --password "YourPassword123!" \
  --repo-path C:\depot\BCApps \
  --run-id my_first_evaluation

Key parameters:

Parameter Description
--category bug-fix or test-generation
--model LLM model to use
--container-name BC container name
--run-id Unique identifier for this run
--al-mcp Enable the AL MCP server (compiler access)
--output-dir Directory for saving results

Enabling the AL MCP server

The --al-mcp flag gives the agent access to the AL compiler via MCP (Model Context Protocol). This lets it compile and get compilation errors during execution:

uv run bcbench evaluate claude microsoft__BCApps-4822 \
  --category bug-fix \
  --model claude-sonnet-4-6 \
  --container-name bcbench \
  --username admin \
  --password "YourPassword123!" \
  --al-mcp

Requirement: The AL Tool must be installed: dotnet tool install -g Microsoft.Dynamics.BusinessCentral.Development.Tools


Available Models

For Claude Code

Model ID
Claude Sonnet 4.6 claude-sonnet-4-6
Claude Opus 4.6 claude-opus-4-6
Claude Haiku 4.5 claude-haiku-4-5

For GitHub Copilot CLI

Model ID
Claude Sonnet 4.6 claude-sonnet-4.6
Claude Opus 4.6 claude-opus-4.6
Claude Haiku 4.5 claude-haiku-4.5
GPT 5.4 gpt-5.4
GPT 5.2 gpt-5.2
GPT 4.1 gpt-4.1

For mini-bc-agent

Model ID
GPT 5.1 Codex Mini gpt-5.1-codex-mini

What to Expect from the Result

After evaluation, the result is saved as a JSONL file ({instance_id}.jsonl) with this structure:

{
  "instance_id": "microsoft__BCApps-4822",
  "model": "claude-sonnet-4-6",
  "agent_name": "Claude Code",
  "category": "bug-fix",
  "resolved": true,
  "build": true,
  "timeout": false,
  "generated_patch": "--- a/...\n+++ b/...",
  "error_message": null,
  "metrics": {
    "execution_time": 145.3,
    "llm_duration": 120.5,
    "turn_count": 12,
    "prompt_tokens": 450000,
    "completion_tokens": 3500,
    "tool_usage": {
      "Read": 15,
      "Grep": 8,
      "Edit": 3,
      "Glob": 5
    }
  },
  "experiment": {
    "mcp_servers": ["altool", "mslearn"],
    "custom_instructions": true,
    "skills_enabled": true,
    "custom_agent": "al-developer-bench"
  }
}

Result fields:


Next: Part 3 — Agent Configuration and Customization