Full evaluations (with build + tests) require a Windows Server with Docker/Hyper-V to run Business Central containers. BC-Bench includes scripts to fully automate the VM setup.
# Run as administrator on the VM
.\scripts\Setup-VM-Phase1.ps1
What it does:
C:\bcbench, C:\ProgramData\BcContainerHelper.\scripts\Setup-VM-Phase2.ps1 `
-AnthropicApiKey "sk-ant-..." `
-GitHubToken "ghp_..." `
-ContainerPassword "BcBench2026!"
What it installs (14 steps with verification):
npm install -g @anthropic-ai/claude-code)C:\bcbenchuv sync --all-groups (Python dependencies)Once the VM is prepared, you need to create the BC container and clone the evaluation repository.
.\scripts\Setup-ContainerAndRepository.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-ContainerName "bcbench" `
-Username "admin"
What it does:
$env:GITHUB_WORKSPACE\testbed)base_commitNew-BCContainerSyncNote: The container is reused between evaluations that share the same
environment_setup_version. It is only recreated when the BC version changes.
This is the main script for launching evaluations. It supports multiple operation modes.
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-Agent claude `
-Model claude-sonnet-4-6 `
-Category bug-fix
# Baseline only (no custom configuration)
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-Scenario baseline
# Developer agent only
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-Scenario aldc-developer
# Conductor agent (TDD) only
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-Scenario aldc-conductor
# Bugfix firstline agent only
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-Scenario aldc-bugfix
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-CompareBaseline
Runs two rounds: first baseline (no custom config), then with the active configuration in config.yaml.
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-CompareAll `
-PauseBetweenScenarios 180
Runs three rounds: baseline, developer, conductor. The -PauseBetweenScenarios parameter (in seconds) adds a wait between runs to avoid API overload errors.
| Parameter | Values | Default |
|---|---|---|
-InstanceId |
Dataset ID | (all if omitted) |
-Agent |
claude, copilot |
claude |
-Model |
See model list | claude-sonnet-4-6 |
-Category |
bug-fix, test-generation |
bug-fix |
-Scenario |
baseline, aldc-developer, aldc-conductor, aldc-bugfix |
(none) |
-CompareBaseline |
Switch | $false |
-CompareAll |
Switch | $false |
-AlMcp |
Switch | $true |
-SkipContainerSetup |
Switch | $false |
-SkipRepoClone |
Switch | $false |
-TestRun |
Switch (only 2 entries) | $false |
-PauseBetweenScenarios |
Seconds | 0 |
This script runs all scenarios for both agents (Claude + Copilot), collects results, generates a report, and pushes it to the repository.
The script defines 8 scenarios (4 per agent):
| # | Agent | Scenario | Output Directory |
|---|---|---|---|
| 1 | Claude Code | baseline | eval_claude_baseline_{model} |
| 2 | Claude Code | developer | eval_claude_aldc_developer_{model} |
| 3 | Claude Code | conductor | eval_claude_aldc_conductor_{model} |
| 4 | Claude Code | bugfix | eval_claude_aldc_bugfix_{model} |
| 5 | Copilot | baseline | eval_copilot_baseline_{model} |
| 6 | Copilot | developer | eval_copilot_aldc_developer_{model} |
| 7 | Copilot | conductor | eval_copilot_aldc_conductor_{model} |
| 8 | Copilot | bugfix | eval_copilot_aldc_bugfix_{model} |
.\scripts\Run-FullComparison.ps1 `
-InstanceIds "microsoft__BCApps-4822"
.\scripts\Run-FullComparison.ps1 `
-InstanceIds @("microsoft__BCApps-4822", "microsoft__BCApps-4699", "microsoft__BCApps-4766")
.\scripts\Run-FullComparison.ps1 `
-LlmFamily opus `
-InstanceIds "microsoft__BCApps-4822"
The -LlmFamily parameter automatically derives model IDs:
claude-{family}-4-6 (e.g., claude-opus-4-6)claude-{family}-4.6 (e.g., claude-opus-4.6)# Update scripts first
git -C C:\bcbench pull origin main
# Run full comparison with auto-shutdown
.\scripts\Run-FullComparison.ps1 `
-InstanceIds @("microsoft__BCApps-4822") `
-LlmFamily sonnet `
-OnlyMissing `
-EmailTo "your-email@gmail.com" `
-AutoShutdown
Key parameters:
-OnlyMissing: Skip scenarios that already have results (to resume failed runs)-EmailTo: Send summary email on completion (requires $env:GMAIL_APP_PASSWORD)-AutoShutdown: Shut down the VM after completion (saves cloud costs)-SkipClaude / -SkipCopilot: Run only one agent-PauseBetweenScenarios: Seconds to wait between scenarios (default: 30)-AutoShutdown)# Step 1: Ensure the container exists
.\scripts\Setup-ContainerAndRepository.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-RepoPath "C:\bcbench\testbed"
# Step 2: Run baseline
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-Scenario baseline `
-Agent claude `
-Model claude-sonnet-4-6 `
-RepoPath "C:\bcbench\testbed" `
-OutputDir "C:\bcbench\eval_baseline"
# Step 3: Run with custom configuration (reuse container)
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-Scenario aldc-developer `
-Agent claude `
-Model claude-sonnet-4-6 `
-SkipContainerSetup `
-SkipRepoClone `
-RepoPath "C:\bcbench\testbed" `
-OutputDir "C:\bcbench\eval_developer"
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-Category test-generation `
-Scenario aldc-conductor `
-Agent claude `
-Model claude-opus-4-6
.\scripts\Run-FullComparison.ps1 `
-InstanceIds @(
"microsoft__BCApps-4822",
"microsoft__BCApps-4699",
"microsoft__BCApps-4766"
) `
-SkipClaude `
-LlmFamily sonnet `
-Category bug-fix `
-OnlyMissing
Useful when the API returns overload errors (overloaded_error):
# Scenario 1: Baseline
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-Scenario baseline `
-RepoPath "C:\bcbench\testbed"
# Wait 5 minutes
Start-Sleep -Seconds 300
# Scenario 2: Developer
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-Scenario aldc-developer `
-SkipContainerSetup -SkipRepoClone `
-RepoPath "C:\bcbench\testbed"
# Wait 5 minutes
Start-Sleep -Seconds 300
# Scenario 3: Conductor
.\scripts\Setup-ALDCEvaluation.ps1 `
-InstanceId "microsoft__BCApps-4822" `
-Scenario aldc-conductor `
-SkipContainerSetup -SkipRepoClone `
-RepoPath "C:\bcbench\testbed"