LegacyCodeBench

Execution-Based Evaluation of AI Comprehension for Legacy COBOL Systems

Nikita & Thiyagarajan M
Kalmantic Labs

January 2026

Abstract

We introduce LegacyCodeBench, a benchmark for evaluating whether AI systems can accurately understand and document legacy COBOL code. Unlike existing benchmarks that test code generation (HumanEval, SWE-bench), LegacyCodeBench tests code comprehension—a critical capability for the $2 trillion legacy modernization industry.

Our key methodological innovation is claim-based behavioral verification: instead of using LLM-as-judge or LLM code regeneration, we extract behavioral claims from AI-generated documentation and verify them by executing the original program. This approach provides objective verification while remaining 100% deterministic.

LegacyCodeBench comprises 200 real-world COBOL programs across 4 complexity tiers. Evaluating 5 state-of-the-art models, we find specialized COBOL models (Legacy Insights, AWS Transform) achieve 88-92% on the benchmark, while general-purpose models (Claude Sonnet 4, GPT-4o) achieve 86-90%. Models maintain strong performance across complexity tiers (0-9% degradation).

200
real-world COBOL programs

92%
best model performance

100%
deterministic evaluation

0-9%
performance degradation (T1→T4)

Download PDF View Code

Model Performance Results

Model	Provider	LCB Score	SC	DQ	BF
Legacy Insights	Hexaview	92%	94%	96%	90%
Claude Sonnet 4	Anthropic	90%	96%	78%	91%
AWS Transform	AWS	88%	98%	68%	91%
IBM Granite 13B	IBM	87%	93%	72%	90%
GPT-4o	OpenAI	86%	92%	71%	89%

SC = Structural Completeness (30%) | DQ = Documentation Quality (20%) | BF = Behavioral Fidelity (50%)

Complexity Tier Analysis

Model	T1 (Basic)	T4 (Enterprise)	Degradation
Legacy Insights	96%	90%	6%
Claude Sonnet 4	92%	92%	0%
AWS Transform	88%	87%	1%
IBM Granite 13B	89%	84%	5%
GPT-4o	91%	82%	9%

Key Finding: All models maintain strong performance across complexity tiers. Specialized COBOL models demonstrate robustness to enterprise complexity.

Benchmark Design

📊 Three-Track Scoring

Structural Completeness (30%): All required sections present | Documentation Quality (20%): Content accuracy | Behavioral Fidelity (50%): Execution-based verification

✅ Claim-Based Verification

Extract behavioral claims from AI-generated documentation and verify by executing the original COBOL program. Completely deterministic, no LLM-as-judge.

🛡️ Anti-Gaming Mechanisms

Silence Penalty penalizes vague documentation. Critical Failure detection triggers hard stops for hallucinated variables or external calls.

🎯 4 Complexity Tiers

T1: 80 tasks (Basic) | T2: 31 tasks (Moderate) | T3: 30 tasks (Complex) | T4: 59 tasks (Enterprise)

🔒 Reproducibility

100% deterministic evaluation with minimal LLM usage. Frozen TF-IDF vectorizer, deterministic BSM patterns, versioned Docker image.

📈 Real-World Data

200 programs from 8 public GitHub repositories, 300-3000 LOC each, spanning banking, finance, insurance, and retail domains.

Key Findings

Specialized Models Outperform

Legacy Insights (92%) and AWS Transform (88%) significantly outperform general-purpose models. Domain specialization is key to enterprise COBOL comprehension.

No Complexity Wall

Contrary to prior assumptions, models maintain strong performance across complexity tiers (0-9% degradation from T1 to T4). Enterprise-scale COBOL is increasingly solvable with specialized approaches.

Execution-Based Evaluation Works

Claim-based behavioral verification provides objective assessment without circular validation problems. This methodology transfers to other legacy code comprehension tasks.

Cite this paper

@article{legacycodebench2026, title={LegacyCodeBench: Execution-Based Evaluation of AI Comprehension for Legacy COBOL Systems}, author={Nikita and Maruthavanan, Thiyagarajan}, year={2026}, month={January}, url={https://github.com/kalmantic/legacycodebench}, journal={arXiv}, }