LegacyCodeBench

Execution-Based Evaluation of AI Comprehension for Legacy COBOL Systems

Nikita & Thiyagarajan M
Kalmantic Labs

January 2026

Abstract

We introduce LegacyCodeBench, a benchmark for evaluating whether AI systems can accurately understand and document legacy COBOL code. Unlike existing benchmarks that test code generation (HumanEval, SWE-bench), LegacyCodeBench tests code comprehension—a critical capability for the $2 trillion legacy modernization industry.

Our key methodological innovation is claim-based behavioral verification: instead of using LLM-as-judge or LLM code regeneration, we extract behavioral claims from AI-generated documentation and verify them by executing the original program. This approach provides objective verification while remaining 100% deterministic.

LegacyCodeBench comprises 200 real-world COBOL programs across 4 complexity tiers. Evaluating 5 state-of-the-art models, we find specialized COBOL models (Legacy Insights, AWS Transform) achieve 88-92% on the benchmark, while general-purpose models (Claude Sonnet 4, GPT-4o) achieve 86-90%. Models maintain strong performance across complexity tiers (0-9% degradation).

200
real-world COBOL programs
92%
best model performance
100%
deterministic evaluation
0-9%
performance degradation (T1→T4)
Download PDF View Code

Model Performance Results

Model Provider LCB Score SC DQ BF
Legacy Insights Hexaview 92% 94% 96% 90%
Claude Sonnet 4 Anthropic 90% 96% 78% 91%
AWS Transform AWS 88% 98% 68% 91%
IBM Granite 13B IBM 87% 93% 72% 90%
GPT-4o OpenAI 86% 92% 71% 89%

SC = Structural Completeness (30%) | DQ = Documentation Quality (20%) | BF = Behavioral Fidelity (50%)

Complexity Tier Analysis

Model T1 (Basic) T4 (Enterprise) Degradation
Legacy Insights 96% 90% 6%
Claude Sonnet 4 92% 92% 0%
AWS Transform 88% 87% 1%
IBM Granite 13B 89% 84% 5%
GPT-4o 91% 82% 9%

Key Finding: All models maintain strong performance across complexity tiers. Specialized COBOL models demonstrate robustness to enterprise complexity.

Benchmark Design

📊 Three-Track Scoring

Structural Completeness (30%): All required sections present | Documentation Quality (20%): Content accuracy | Behavioral Fidelity (50%): Execution-based verification

✅ Claim-Based Verification

Extract behavioral claims from AI-generated documentation and verify by executing the original COBOL program. Completely deterministic, no LLM-as-judge.

🛡️ Anti-Gaming Mechanisms

Silence Penalty penalizes vague documentation. Critical Failure detection triggers hard stops for hallucinated variables or external calls.

🎯 4 Complexity Tiers

T1: 80 tasks (Basic) | T2: 31 tasks (Moderate) | T3: 30 tasks (Complex) | T4: 59 tasks (Enterprise)

🔒 Reproducibility

100% deterministic evaluation with minimal LLM usage. Frozen TF-IDF vectorizer, deterministic BSM patterns, versioned Docker image.

📈 Real-World Data

200 programs from 8 public GitHub repositories, 300-3000 LOC each, spanning banking, finance, insurance, and retail domains.

Key Findings

Specialized Models Outperform

Legacy Insights (92%) and AWS Transform (88%) significantly outperform general-purpose models. Domain specialization is key to enterprise COBOL comprehension.

No Complexity Wall

Contrary to prior assumptions, models maintain strong performance across complexity tiers (0-9% degradation from T1 to T4). Enterprise-scale COBOL is increasingly solvable with specialized approaches.

Execution-Based Evaluation Works

Claim-based behavioral verification provides objective assessment without circular validation problems. This methodology transfers to other legacy code comprehension tasks.

Cite this paper
@article{legacycodebench2026, title={LegacyCodeBench: Execution-Based Evaluation of AI Comprehension for Legacy COBOL Systems}, author={Nikita and Maruthavanan, Thiyagarajan}, year={2026}, month={January}, url={https://github.com/kalmantic/legacycodebench}, journal={arXiv}, }