COBOL to Kotlin via Formal Models (IR and Alloy and Golden Master)

5 hours ago 1

Marco Graziano

My First Experiment in Verifiable Modernization

By Marco Graziano

For decades, COBOL systems have powered the financial and administrative core of institutions worldwide. Despite being over six decades old, COBOL remains the invisible backbone of modern civilization’s financial and administrative systems.

In banking alone, an estimated 43% of all core systems in the US are still built on COBOL, powering roughly 95% of ATM transactions and 80% of in-person financial services, processing an astonishing $3 trillion in daily commerce.

The insurance sector is equally dependent: about 67% of US top-tier firms continue to rely on COBOL for critical operations such as policy management, billing, and claims.

Even in healthcare, where newer systems have made inroads, legacy COBOL applications still handle administrative functions for millions — some serving over 60 million patients daily. This enduring reliance exists not out of nostalgia, but necessity: COBOL’s reliability, stability, and the prohibitive cost and risk of replacing decades of deeply integrated logic make it one of the most mission-critical technologies ever built.

This persistent reliance on COBOL highlights a simple truth: the systems that move our money, insure our lives, and manage our hospitals cannot afford experiments — they need modernization grounded in proof, not hope.

Yet most modernization projects still rely on translation, not understanding. I wanted to explore something different: a reproducible, verifiable path from COBOL to modern code, grounded in formal semantics rather than pattern-based rewriting.

This post documents my first working prototype, a complete modernization playbook that takes a simple COBOL batch program, produces an intermediate representation (IR), transforms it into formal models, generates Kotlin code, and finally proves functional equivalence against the legacy output.

Why Modernization Needs to Be Formal

Legacy modernization is often treated like linguistic translation, a line of COBOL becomes a line of Java. But translation without semantics is brittle. What if we could instead reconstruct the meaning of the system? That’s where formal methods come in.

By expressing a system’s invariants and relationships in formal languages such as Alloy, TLA+, or SMT (Z3), we can verify that the modern implementation is not just similar, but provably equivalent.

In this approach, modernization becomes a semantic reconstruction process:

  1. Extract the behavior (Intermediate Representation).
  2. Model it formally.
  3. Regenerate it in a modern language.
  4. Compare outputs to confirm equivalence.

It’s not about replacing the past as it is about translating intent.

The Experiment: A COBOL Posting System

The prototype starts with a simple COBOL program, DAILYPOST.cbl, which performs daily account postings.

IDENTIFICATION DIVISION.
PROGRAM-ID. DAILYPOST.

ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT ACCT-FILE ASSIGN TO "data/accounts.dat"
ORGANIZATION IS LINE SEQUENTIAL.
SELECT TXN-FILE ASSIGN TO "data/txns.dat"
ORGANIZATION IS LINE SEQUENTIAL.
SELECT ACCT-OUT ASSIGN TO "out/accounts_out.dat"
ORGANIZATION IS LINE SEQUENTIAL.
SELECT EXC-FILE ASSIGN TO "out/exceptions.dat"
ORGANIZATION IS LINE SEQUENTIAL.

DATA DIVISION.
FILE SECTION.
FD ACCT-FILE RECORD CONTAINS 58 CHARACTERS.
01 ACCT-IN-REC.
COPY "ACCOUNT-REC.cpy".

FD TXN-FILE RECORD CONTAINS 72 CHARACTERS.
01 TXN-IN-REC.
COPY "TXN-REC.cpy".

WORKING-STORAGE SECTION.
77 EOF-ACCT PIC X VALUE "N".
77 EOF-TXN PIC X VALUE "N".
77 NEW-BAL PIC S9(11)V99 COMP-3.
77 TODAY PIC 9(8) VALUE 20231107.

PROCEDURE DIVISION.
MAIN.
OPEN INPUT ACCT-FILE TXN-FILE
OUTPUT ACCT-OUT EXC-FILE
PERFORM UNTIL EOF-ACCT = "Y"
READ ACCT-FILE INTO ACCT-IN-REC
AT END MOVE "Y" TO EOF-ACCT
NOT AT END
PERFORM APPLY-TODAYS-TXNS
WRITE ACCT-OUT-REC FROM ACCT-IN-REC
END-READ
END-PERFORM
CLOSE ACCT-FILE TXN-FILE ACCT-OUT EXC-FILE
STOP RUN.```cobol
IDENTIFICATION DIVISION.
PROGRAM-ID. DAILYPOST.
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT ACCT-FILE ASSIGN TO "data/accounts.dat"
ORGANIZATION IS LINE SEQUENTIAL.
SELECT TXN-FILE ASSIGN TO "data/txns.dat"
ORGANIZATION IS LINE SEQUENTIAL.
SELECT ACCT-OUT ASSIGN TO "out/accounts_out.dat"
ORGANIZATION IS LINE SEQUENTIAL.
SELECT EXC-FILE ASSIGN TO "out/exceptions.dat"
ORGANIZATION IS LINE SEQUENTIAL.

DATA DIVISION.
FILE SECTION.
FD ACCT-FILE RECORD CONTAINS 58 CHARACTERS.
01 ACCT-IN-REC.
COPY "ACCOUNT-REC.cpy".

FD TXN-FILE RECORD CONTAINS 72 CHARACTERS.
01 TXN-IN-REC.
COPY "TXN-REC.cpy".

WORKING-STORAGE SECTION.
77 EOF-ACCT PIC X VALUE "N".
77 EOF-TXN PIC X VALUE "N".
77 NEW-BAL PIC S9(11)V99 COMP-3.
77 TODAY PIC 9(8) VALUE 20231107.

PROCEDURE DIVISION.
MAIN.
OPEN INPUT ACCT-FILE TXN-FILE
OUTPUT ACCT-OUT EXC-FILE
PERFORM UNTIL EOF-ACCT = "Y"
READ ACCT-FILE INTO ACCT-IN-REC
AT END MOVE "Y" TO EOF-ACCT
NOT AT END
PERFORM APPLY-TODAYS-TXNS
WRITE ACCT-OUT-REC FROM ACCT-IN-REC
END-READ
END-PERFORM
CLOSE ACCT-FILE TXN-FILE ACCT-OUT EXC-FILE
STOP RUN.

From there, I built a modernization pipeline that moves through several clear stages:

Press enter or click to view image in full size

Each makestep corresponds to a transparent phase in the pipeline. The focus is not automation for its own sake, but traceability and every artifact is inspectable and reproducible.

The Strategy: From Structure to Semantics

Large-scale modernization begins with a structured intermediate representation (IR). This is not a compiler artifact but rather a conceptual map of the system’s meaning.

The Layered IR

Layer Description Syntactic IR Parses COBOL into paragraphs, sections, data declarations, and copybooks Structural IR Captures record layouts, field types, and file I/O contracts Control-Flow IR (CFG) Builds a flow graph of PERFORM, EVALUATE, and IF branches Effects IR Tracks reads/writes to files and variables Semantic IR Expresses the algebra of transformations — e.g. “balance’ = balance – amount”

Each layer adds fidelity without losing context. Even if we stop halfway, the resulting model already serves as documentation, dependency graph, and modernization map.

The Role of Formal Models

⚙️ Alloy

Alloy models static structure — relationships among records, constraints, and invariants.

Example: “Every transaction belongs to exactly one account.”
This is captured in formal/alloy/banking.als and verified using the Alloy Analyzer.

⏱ TLA+

TLA+ expresses temporal behavior — perfect for COBOL’s batch workflows.
It allows us to specify conditions like: “After all transactions are processed, every account balance must be updated exactly once.”

🧮 SMT / Z3

Z3 (Satisfiability Modulo Theories) checks arithmetic and logical consistency. It’s ideal for proving that rewritten arithmetic (for instance, interest or overdraft rules) is equivalent at a symbolic level, not just with sample data.

The Golden Master Approach

A Golden Master harness guarantees behavioral equivalence between the COBOL and modern versions.

  1. Both systems run with the same fixed input files (data/accounts.dat, data/txns.dat).
  2. Each writes its results to out/accounts_out_*.dat.
  3. Python scripts convert fixed-width output to CSV and compute SHA-256 checksums.
  4. If the hashes match — behavior is proven identical.

This process is defined by:

  • scripts/compare_runs.py — performs record-by-record comparisons
  • tests/artifacts/ — stores all test outputs for reproducibility

The Modern Runner in Kotlin

Instead of blindly translating COBOL to Java, I wrote a clean, idiomatic Kotlin equivalent.

This “modern runner” lives under rewrite/kotlin and is managed by Gradle. It’s designed to read the same fixed-width data, apply the same transformations, and produce the same results.

To run it:

cd rewrite/kotlin
env -u JAVA_TOOL_OPTIONS ./gradlew run

Results appear under out/accounts_out_modern.dat, which can be directly diffed against the COBOL version.

Repository Structure

Here’s how the repository is organized:

cobol-modernization-playbook/
├── cobol/ # Legacy COBOL source and copybooks
├── data/ # Input datasets (fixed-width)
├── parsing/out/ir.json # Intermediate Representation (IR)
├── formal/alloy/ # Alloy model of core invariants
├── rewrite/kotlin/ # Modernized Kotlin codebase
├── scripts/ # Supporting Python tooling
│ ├── parse_stub.py
│ ├── ir_to_alloy.py
│ ├── ir_to_kotlin.py
│ ├── compare_runs.py
│ └── modern_runner.py
├── docs/ # Diagrams and documentation
│ ├── diagram-architecture.png
│ ├── diagram-sequence.png
│ └── COBOL_Reverse_Engineering_Strategy.md
├── tests/artifacts/ # Golden master outputs
└── Makefile # One-liner build targets for the full flow

The result is a miniature modernization lab — you can read, run, and extend it without vendor tools. I have used GnuCOBOL for this experiment.

Next Steps

This project is only the beginning. My goal is to generalize the process for large-scale COBOL systems with complex batch and transaction logic, the kind that still powers banks, insurers, and government agencies worldwide.

I would like to collaborate on a real modernization effort, where legacy reliability meets formal verification and modern architectures. If you’re working on such a system or planning a migration, I’d be eager to contribute this formal, verifiable approach to help bridge the gap between mainframe dependability and modern software engineering.

Immediate roadmap:

  • Extend the COBOL parser to handle PERFORM THRU, nested conditionals, and copybook redefinitions.
  • Generate more expressive Alloy models (e.g., “no balance below -overdraft”).
  • Add TLA+ modules to verify batch ordering and liveness properties.
  • Eventually evolve the modern Kotlin runner into a REST or gRPC microservice — still semantically verifiable.

Why This Matters

Modernization is not just about replacing old technology. It’s about preserving institutional knowledge and ensuring correctness at scale.

By combining formal verification and golden master equivalence, we move from risky rewrites to provable modernization, a process that respects both legacy systems and modern engineering standards.

Final Thoughts

Modernization is not just a code migration problem, it’s an epistemic one.
We’re not rewriting syntax; we’re reconstructing truth about what the system does, why it does it, and how to ensure it keeps doing so.

If modernization becomes verifiable, it also becomes trustworthy.
That’s how we finally move beyond decades of “legacy debt” and make modernization a scientific, repeatable process.

Read Entire Article