TransXform — Boundary-Governed Transformer Training Supervisor

The Problem

You've watched a run die silently.
Everyone has.

Training is open-loop. Loss goes down, dashboards stay green, and the model learns nothing useful. There is no mechanism for self-correction.

✕

Representational collapse. Every hidden state converges to the same vector. The model has capacity for nothing.
✕

Dead attention heads. Entire heads produce uniform distributions. Zero gradient. From the optimizer's perspective: converged. From yours: dead.
✕

Shortcut learning. The model memorizes positional patterns instead of learning actual language structure. Validation loss looks fine.
✕

Loss explosion. Gradient spikes silently corrupt weeks of learned representations. You only notice when downstream evals tank.
✕

No alarm fires. Because there is no alarm. Standard training infrastructure has no concept of model health.

"Degenerate solutions are not transient — they are stable minima. A collapsed emission head has zero gradient. A dead attention head has zero gradient. From the optimizer's perspective, these states are converged. From the practitioner's perspective, the model is dead." — TransXform Whitepaper

The Solution

Closed-loop training supervision

TransXform treats gradient updates as provisional. Every step is validated against hard invariants and diagnostic signals. Violations trigger automatic, component-local interventions.

■

Hard Invariants

Five non-negotiable constraints enforced every step. Cosine collapse, gradient death, variance collapse, loss explosion, gradient spikes. Illegal states are corrected immediately.

◆

Advisory Diagnostics

13 diagnostic signals that detect subtle pathologies: unused capacity, shortcut learning, threshold drift, intervention futility, gradient domination, overfitting, and more.

▶

Phase-Gated Governance

Training phases with readiness gates. The model must prove competence before advancing. No more "let it cook" while the model is silently dying.

⚙

Component-Local Interventions

Reinitialize frozen submodules. Rescale collapsed representations. Dampen gradient spikes. All while preserving healthy learned structure elsewhere.

☷

Spec-Driven (YAML)

Every invariant, threshold, phase, and intervention rule is declared in a YAML spec. Fully auditable, version-controlled, reproducible.

▣

Witness Console + Merkle Audit

Optional TUI showing all metrics, violations, and interventions in real time. Cryptographic audit trail via Merkle tree for full reproducibility.

Diagnostic Signals

18 signals. Zero blind spots.

Five hard invariants that enforce immediate correction. Thirteen advisory diagnostics that surface subtle pathologies before they become catastrophic.

Hard Invariants V1 · Enforced

Cosine Collapse Representations converge
Gradient Death Zero gradient flow
Variance Collapse Activations flatten
Loss Explosion Divergent loss signal
Gradient Spikes Destructive updates

Advisory Diagnostics V2 · Signals

Unused Capacity Dead parameters
Shortcut Learning Positional memorization
Threshold Drift Boundary erosion
Metric Instability Noisy signals
Intervention Futility Corrections not sticking
Gradient Domination Component imbalance
Overfitting Detection Train/val divergence
+ 6 more signals See docs

Declarative Specs

Define training law in YAML

Every invariant, threshold, and phase gate is declared in a spec file. No custom code. No callback spaghetti. Just a contract that the supervisor enforces.

transxform-spec.yaml

# TransXform training spec
invariants:
  cosine_collapse:
    threshold: 0.95
    window: 100
    action: reinitialize
  gradient_death:
    threshold: 1e-7
    consecutive_steps: 50
    action: rescale
  variance_collapse:
    min_variance: 0.01
    action: perturb

phases:
  - name: warmup
    max_steps: 2000
    gate:
      loss_below: 4.0
      no_violations_for: 200
  - name: core_training
    max_steps: 50000
    gate:
      loss_below: 2.5
      capacity_utilization_above: 0.7

witness:
  enabled: true
  refresh_ms: 250

merkle_audit:
  enabled: true
  output: ./audit/run_{timestamp}.merkle

Invariants as Law

Each invariant defines a threshold, a detection window, and an automatic intervention. When cosine similarity exceeds 0.95 across representations, the supervisor reinitializes the collapsed component. No human required.

Phase Gates as Proof

The model cannot advance from warmup to core training until it proves competence: loss below target AND no invariant violations for 200 consecutive steps. Hope is not a strategy.

Full Audit Trail

Every supervisor decision — every violation detected, every intervention applied, every phase gate evaluated — is recorded in a Merkle-chained log. Cryptographic proof of what happened and why.

War Story

"Let it cook."

Training LOOM — a dual-process recurrent transformer. Loss was decreasing the entire time. Every failure was silent. At every checkpoint, the answer was the same.

Step 4,000 — Stability loss rewarding trivial convergence "let it cook"

The stability loss term was rewarding the model for producing identical outputs regardless of input. Loss curve looked healthy. The model was learning to do nothing, confidently.

Step 6,000 — Emission collapse "let it cook"

The emission heads collapsed to a single degenerate point. Zero gradient. The optimizer saw convergence. The model had lost the ability to produce varied outputs.

14K

Step 14,000 — Recurrence erasing input signal "let it cook"

The recurrent pathway was overwriting the input signal entirely. Each recurrence step made the representation less informative. 14,000 steps of compute, wasted.

FIX

TransXform — None of these failures would have persisted

Cosine collapse detection catches emission collapse at step 6,000. Variance monitoring catches the trivial convergence at step 4,000. Gradient death detection catches the frozen recurrence. Each violation triggers an immediate, targeted intervention. The training self-corrects.

Battle-Tested

5 real training runs. 1.6M to 468M parameters.

Not toy benchmarks. Real architectures, real data, real failures caught in production.

ICFU 1.6M params

HEALTHY

Set transformer on 30K AEONCORE reasoning traces. 76,647 steps, 817K training samples.

Discovery: Epoch-boundary representation collapse — three components simultaneously spiked to cosine 0.976–0.999 at step 25,548 and 51,097. TransXform detected all instances without being programmed to look for them. Recovery within 100 steps.

Affect loss 0.054→0.001 • Intent mastered by step 100 • Focus 0.210→0.026

SemanticNormalizer 468M params

ZERO VIOLATIONS

24-layer transformer, text→semantic packet generation. Two runs: V1.2 failed, V1.3 succeeded.

The lesson: Run 7 (V1.2) transitioned too early with cosine 0.9975 into a 0.98 threshold — immediate violation cascade — death. Run 8 (V1.3) readiness gate held the model 300 extra steps. Zero violations. "You didn't change the model. We changed the timing of pressure."

468M params • Grad norms 0.35–0.50 (steady) • Variance 0.41→4.15 (healthy)

CRUX 2.1M params

ABORTED (CORRECT)

Intentionally pathological: 192→32 bottleneck, 3 competing heads, 400:1 gradient ratio. Designed to be unsalvageable.

Negative capability: 29 violations, 12 interventions exhausted, clean abort at step 221. Signal 10 (InterventionFutility) predicted abort 76 steps early. Signal 11 identified 102x gradient domination ratio.

Knowing when to stop is as important as knowing how to fix.

FROG 525K params

SHORTCUT DETECTED

2-layer transformer with gradual label leak — slow poison ramping 0→5.0 over 15K steps. Zero hard violations. V1 sees nothing wrong.

V2 catches it: Step 7,900 — ShortcutLearning advisory fires. "Loss improved 37.7% while activation variance increased 63%." Variance eventually exploded to 57x baseline. Loss hit 0.000056. The model fully memorized the shortcut.

V2 detects silent poisoning invisible to V1 structural checks.

MIRE 525K params

STAGNATION DETECTED

Clean architecture on progressively noisy data (0%→90% label noise over 15K steps). Gradients stay healthy the entire time. Loss plateaus. Standard tooling sees nothing wrong — gradients are flowing, loss isn't exploding.

Signal 7 fires at step 2,400: "Loss has stagnated at 0.5796 for 2,089 steps despite healthy gradient flow. The data signal-to-noise ratio may be too low." — 33,780 total steps, zero hard violations, zero interventions. The advisory was correct and sufficient.

Quick Start

Running in 60 seconds

Add to Cargo.toml

Cargo.toml

# Core supervisor (no optional deps)
transxform = "0.1"

# With PyTorch integration + TUI + audit trail
transxform = { version = "0.1", features = ["tch", "witness", "merkle"] }

Write your spec

bash

# Generate a starter spec
$ transxform init --output transxform-spec.yaml

Integrate in your training loop

main.rs

use transxform::{Supervisor, Spec};

let spec = Spec::from_file("transxform-spec.yaml")?;
let mut supervisor = Supervisor::new(spec);

// In your training loop:
for step in 0..max_steps {
    let grads = compute_gradients(&model, &batch);

    // Supervisor validates before applying
    let verdict = supervisor.evaluate(&model, &grads, step);

    match verdict {
        Verdict::Apply(grads) => optimizer.step(grads),
        Verdict::Intervene(fix) => fix.apply(&mut model),
        Verdict::HaltPhase(reason) => break,
    }
}

Silent training failure
is no longer possible

Hard Invariants

Advisory Diagnostics

Phase-Gated Governance

Component-Local Interventions

Spec-Driven (YAML)

Witness Console + Merkle Audit

Hard Invariants V1 · Enforced

Advisory Diagnostics V2 · Signals

Invariants as Law

Phase Gates as Proof

Full Audit Trail

Step 4,000 — Stability loss rewarding trivial convergence "let it cook"

Step 6,000 — Emission collapse "let it cook"

Step 14,000 — Recurrence erasing input signal "let it cook"

TransXform — None of these failures would have persisted

ICFU 1.6M params

SemanticNormalizer 468M params

CRUX 2.1M params

FROG 525K params

MIRE 525K params

TransXLab validates BEFORE training

Add to Cargo.toml

Write your spec

Integrate in your training loop

Silent training failureis no longer possible

Hard Invariants

Advisory Diagnostics

Phase-Gated Governance

Component-Local Interventions

Spec-Driven (YAML)

Witness Console + Merkle Audit

Hard Invariants V1 · Enforced

Advisory Diagnostics V2 · Signals

Invariants as Law

Phase Gates as Proof

Full Audit Trail

Step 4,000 — Stability loss rewarding trivial convergence "let it cook"

Step 6,000 — Emission collapse "let it cook"

Step 14,000 — Recurrence erasing input signal "let it cook"

TransXform — None of these failures would have persisted

ICFU 1.6M params

SemanticNormalizer 468M params

CRUX 2.1M params

FROG 525K params

MIRE 525K params

TransXLab validates BEFORE training

Add to Cargo.toml

Write your spec

Integrate in your training loop

Silent training failure
is no longer possible