LegacySWE

Frontier coding systems still exhibit a material limitation in intelligence when applied to enterprise software engineering in low-resource languages.

Critical systems are still written and maintained in languages such as COBOL, Fortran, and Assembly. In these environments, the central difficulty is not syntax. It is latent semantics: business logic distributed across programs, record layouts, batch jobs, routing logic, status codes, interfaces, conventions, and historical edits.

LegacySWE is being built to study this capability regime directly.

The Gap

The gap is defined by the intersection of three conditions.

  1. The language is low-resource.
  2. The system is enterprise-critical.
  3. The work is maintenance or modernization rather than greenfield generation.

Each condition narrows the margin for error. Low-resource languages provide weaker priors and thinner tooling. Enterprise systems encode correctness in weakly declared invariants rather than explicit specifications. Maintenance work requires controlled modification of an existing system rather than unconstrained synthesis from scratch.

Taken together, these conditions reveal a capability deficit in current frontier systems. A model may produce code that is locally plausible, syntactically valid, or even superficially well reasoned, while still failing to recover the hidden contract that governs the system. A patch can look locally correct and still break the surrounding enterprise behavior. Progress that appears strong on popular languages and cleaner tasks does not transfer reliably into this regime.

Introducing LegacySWE

We believe current frontier models materially underperform in long-lived enterprise systems.

Our view is that this underperformance is not explained primarily by pretraining scarcity in low-resource languages. The deeper limitation is a system's ability to recover implicit contracts, fragmented business logic, and distributed operational semantics from real enterprise codebases.

LegacySWE is Metaphi's research effort to help coding agents hill-climb on the hardest maintenance and modernization tasks in legacy enterprise systems. We are acquiring proprietary enterprise codebases and working with domain experts, many of whom have more than twenty years of experience in these systems, to construct representative tasks and environments for measuring frontier-model capability in this regime.

First Release: COBOLBench

The first public release under LegacySWE is COBOLBench, a benchmark for frontier coding agents on enterprise COBOL maintenance.

COBOLBench evaluates whether an agent can complete maintenance work inside a real enterprise-style COBOL environment under automated correctness checks. It is designed to test something stricter than whether a model can generate plausible COBOL. The target is not valid COBOL in isolation, but edits that preserve latent business behavior.

On the current 100-task public release, the best evaluated system reaches 11% Pass@4, only 12 of 100 tasks are solved by any evaluated system, and 88 of 100 tasks remain unsolved by every evaluated system.

Read the full benchmark release here: evals.metaphi.ai/cobolbench/blog.

Contribute

We invite the community to contribute to LegacySWE and help shape the future of agentic coding in enterprises.

  1. Researchers who want to study harder capability surfaces for coding agents and post-training.
  2. Enterprises that want to submit agent entries or evaluate frontier systems against this regime.
  3. Domain experts who wish to collaborate on task construction, review, and validation.

Reach us at agent@metaphi.ai.