Execution-Based Evaluation of AI Comprehension for Legacy COBOL Systems
January 2026
We introduce LegacyCodeBench, a benchmark for evaluating whether AI systems can accurately understand and document legacy COBOL code. Unlike existing benchmarks that test code generation (HumanEval, SWE-bench), LegacyCodeBench tests code comprehension—a critical capability for the $2 trillion legacy modernization industry.
Our key methodological innovation is claim-based behavioral verification: instead of using LLM-as-judge or LLM code regeneration, we extract behavioral claims from AI-generated documentation and verify them by executing the original program. This approach provides objective verification while remaining 100% deterministic.
LegacyCodeBench comprises 200 real-world COBOL programs across 4 complexity tiers. Evaluating 5 state-of-the-art models, we find specialized COBOL models (Legacy Insights, AWS Transform) achieve 88-92% on the benchmark, while general-purpose models (Claude Sonnet 4, GPT-4o) achieve 86-90%. Models maintain strong performance across complexity tiers (0-9% degradation).
| Model | Provider | LCB Score | SC | DQ | BF |
|---|---|---|---|---|---|
| Legacy Insights | Hexaview | 92% | 94% | 96% | 90% |
| Claude Sonnet 4 | Anthropic | 90% | 96% | 78% | 91% |
| AWS Transform | AWS | 88% | 98% | 68% | 91% |
| IBM Granite 13B | IBM | 87% | 93% | 72% | 90% |
| GPT-4o | OpenAI | 86% | 92% | 71% | 89% |
SC = Structural Completeness (30%) | DQ = Documentation Quality (20%) | BF = Behavioral Fidelity (50%)
| Model | T1 (Basic) | T4 (Enterprise) | Degradation |
|---|---|---|---|
| Legacy Insights | 96% | 90% | 6% |
| Claude Sonnet 4 | 92% | 92% | 0% |
| AWS Transform | 88% | 87% | 1% |
| IBM Granite 13B | 89% | 84% | 5% |
| GPT-4o | 91% | 82% | 9% |
Key Finding: All models maintain strong performance across complexity tiers. Specialized COBOL models demonstrate robustness to enterprise complexity.
Structural Completeness (30%): All required sections present | Documentation Quality (20%): Content accuracy | Behavioral Fidelity (50%): Execution-based verification
Extract behavioral claims from AI-generated documentation and verify by executing the original COBOL program. Completely deterministic, no LLM-as-judge.
Silence Penalty penalizes vague documentation. Critical Failure detection triggers hard stops for hallucinated variables or external calls.
T1: 80 tasks (Basic) | T2: 31 tasks (Moderate) | T3: 30 tasks (Complex) | T4: 59 tasks (Enterprise)
100% deterministic evaluation with minimal LLM usage. Frozen TF-IDF vectorizer, deterministic BSM patterns, versioned Docker image.
200 programs from 8 public GitHub repositories, 300-3000 LOC each, spanning banking, finance, insurance, and retail domains.
Legacy Insights (92%) and AWS Transform (88%) significantly outperform general-purpose models. Domain specialization is key to enterprise COBOL comprehension.
Contrary to prior assumptions, models maintain strong performance across complexity tiers (0-9% degradation from T1 to T4). Enterprise-scale COBOL is increasingly solvable with specialized approaches.
Claim-based behavioral verification provides objective assessment without circular validation problems. This methodology transfers to other legacy code comprehension tasks.