Rust · Zero Python Dependencies · Spec-Driven

boundary-governed transformer training

Silent training failure
is no longer possible

TransXform interposes a supervisory authority between the optimizer and the model. Gradient updates are provisional state changes subject to post-commit correction. Illegal states cannot persist.

5
Hard Invariants
13
Diagnostic Signals
5
Validated Runs
468M
Largest Model Supervised
You've watched a run die silently.
Everyone has.
Training is open-loop. Loss goes down, dashboards stay green, and the model learns nothing useful. There is no mechanism for self-correction.
  • Representational collapse. Every hidden state converges to the same vector. The model has capacity for nothing.
  • Dead attention heads. Entire heads produce uniform distributions. Zero gradient. From the optimizer's perspective: converged. From yours: dead.
  • Shortcut learning. The model memorizes positional patterns instead of learning actual language structure. Validation loss looks fine.
  • Loss explosion. Gradient spikes silently corrupt weeks of learned representations. You only notice when downstream evals tank.
  • No alarm fires. Because there is no alarm. Standard training infrastructure has no concept of model health.
"Degenerate solutions are not transient — they are stable minima. A collapsed emission head has zero gradient. A dead attention head has zero gradient. From the optimizer's perspective, these states are converged. From the practitioner's perspective, the model is dead." — TransXform Whitepaper
Closed-loop training supervision
TransXform treats gradient updates as provisional. Every step is validated against hard invariants and diagnostic signals. Violations trigger automatic, component-local interventions.

Hard Invariants

Five non-negotiable constraints enforced every step. Cosine collapse, gradient death, variance collapse, loss explosion, gradient spikes. Illegal states are corrected immediately.

Advisory Diagnostics

13 diagnostic signals that detect subtle pathologies: unused capacity, shortcut learning, threshold drift, intervention futility, gradient domination, overfitting, and more.

Phase-Gated Governance

Training phases with readiness gates. The model must prove competence before advancing. No more "let it cook" while the model is silently dying.

Component-Local Interventions

Reinitialize frozen submodules. Rescale collapsed representations. Dampen gradient spikes. All while preserving healthy learned structure elsewhere.

Spec-Driven (YAML)

Every invariant, threshold, phase, and intervention rule is declared in a YAML spec. Fully auditable, version-controlled, reproducible.

Witness Console + Merkle Audit

Optional TUI showing all metrics, violations, and interventions in real time. Cryptographic audit trail via Merkle tree for full reproducibility.

18 signals. Zero blind spots.
Five hard invariants that enforce immediate correction. Thirteen advisory diagnostics that surface subtle pathologies before they become catastrophic.

Hard Invariants V1 · Enforced

  • Cosine Collapse Representations converge
  • Gradient Death Zero gradient flow
  • Variance Collapse Activations flatten
  • Loss Explosion Divergent loss signal
  • Gradient Spikes Destructive updates

Advisory Diagnostics V2 · Signals

  • Unused Capacity Dead parameters
  • Shortcut Learning Positional memorization
  • Threshold Drift Boundary erosion
  • Metric Instability Noisy signals
  • Intervention Futility Corrections not sticking
  • Gradient Domination Component imbalance
  • Overfitting Detection Train/val divergence
  • + 6 more signals See docs
Define training law in YAML
Every invariant, threshold, and phase gate is declared in a spec file. No custom code. No callback spaghetti. Just a contract that the supervisor enforces.
transxform-spec.yaml
# TransXform training spec invariants: cosine_collapse: threshold: 0.95 window: 100 action: reinitialize gradient_death: threshold: 1e-7 consecutive_steps: 50 action: rescale variance_collapse: min_variance: 0.01 action: perturb phases: - name: warmup max_steps: 2000 gate: loss_below: 4.0 no_violations_for: 200 - name: core_training max_steps: 50000 gate: loss_below: 2.5 capacity_utilization_above: 0.7 witness: enabled: true refresh_ms: 250 merkle_audit: enabled: true output: ./audit/run_{timestamp}.merkle

Invariants as Law

Each invariant defines a threshold, a detection window, and an automatic intervention. When cosine similarity exceeds 0.95 across representations, the supervisor reinitializes the collapsed component. No human required.

Phase Gates as Proof

The model cannot advance from warmup to core training until it proves competence: loss below target AND no invariant violations for 200 consecutive steps. Hope is not a strategy.

Full Audit Trail

Every supervisor decision — every violation detected, every intervention applied, every phase gate evaluated — is recorded in a Merkle-chained log. Cryptographic proof of what happened and why.

"Let it cook."
Training LOOM — a dual-process recurrent transformer. Loss was decreasing the entire time. Every failure was silent. At every checkpoint, the answer was the same.
4K

Step 4,000 — Stability loss rewarding trivial convergence "let it cook"

The stability loss term was rewarding the model for producing identical outputs regardless of input. Loss curve looked healthy. The model was learning to do nothing, confidently.

6K

Step 6,000 — Emission collapse "let it cook"

The emission heads collapsed to a single degenerate point. Zero gradient. The optimizer saw convergence. The model had lost the ability to produce varied outputs.

14K

Step 14,000 — Recurrence erasing input signal "let it cook"

The recurrent pathway was overwriting the input signal entirely. Each recurrence step made the representation less informative. 14,000 steps of compute, wasted.

FIX

TransXform — None of these failures would have persisted

Cosine collapse detection catches emission collapse at step 6,000. Variance monitoring catches the trivial convergence at step 4,000. Gradient death detection catches the frozen recurrence. Each violation triggers an immediate, targeted intervention. The training self-corrects.

5 real training runs. 1.6M to 468M parameters.
Not toy benchmarks. Real architectures, real data, real failures caught in production.

ICFU 1.6M params

HEALTHY

Set transformer on 30K AEONCORE reasoning traces. 76,647 steps, 817K training samples.

Discovery: Epoch-boundary representation collapse — three components simultaneously spiked to cosine 0.976–0.999 at step 25,548 and 51,097. TransXform detected all instances without being programmed to look for them. Recovery within 100 steps.

Affect loss 0.054→0.001 • Intent mastered by step 100 • Focus 0.210→0.026

SemanticNormalizer 468M params

ZERO VIOLATIONS

24-layer transformer, text→semantic packet generation. Two runs: V1.2 failed, V1.3 succeeded.

The lesson: Run 7 (V1.2) transitioned too early with cosine 0.9975 into a 0.98 threshold — immediate violation cascade — death. Run 8 (V1.3) readiness gate held the model 300 extra steps. Zero violations. "You didn't change the model. We changed the timing of pressure."

468M params • Grad norms 0.35–0.50 (steady) • Variance 0.41→4.15 (healthy)

CRUX 2.1M params

ABORTED (CORRECT)

Intentionally pathological: 192→32 bottleneck, 3 competing heads, 400:1 gradient ratio. Designed to be unsalvageable.

Negative capability: 29 violations, 12 interventions exhausted, clean abort at step 221. Signal 10 (InterventionFutility) predicted abort 76 steps early. Signal 11 identified 102x gradient domination ratio.

Knowing when to stop is as important as knowing how to fix.

FROG 525K params

SHORTCUT DETECTED

2-layer transformer with gradual label leak — slow poison ramping 0→5.0 over 15K steps. Zero hard violations. V1 sees nothing wrong.

V2 catches it: Step 7,900 — ShortcutLearning advisory fires. "Loss improved 37.7% while activation variance increased 63%." Variance eventually exploded to 57x baseline. Loss hit 0.000056. The model fully memorized the shortcut.

V2 detects silent poisoning invisible to V1 structural checks.

MIRE 525K params

STAGNATION DETECTED

Clean architecture on progressively noisy data (0%→90% label noise over 15K steps). Gradients stay healthy the entire time. Loss plateaus. Standard tooling sees nothing wrong — gradients are flowing, loss isn't exploding.

Signal 7 fires at step 2,400: "Loss has stagnated at 0.5796 for 2,089 steps despite healthy gradient flow. The data signal-to-noise ratio may be too low." — 33,780 total steps, zero hard violations, zero interventions. The advisory was correct and sufficient.

Two tools. Complete coverage.

TransXLab validates BEFORE training

TransXLab catches bad configurations, incompatible hyperparameters, and doomed setups before you burn a single GPU hour. TransXform supervises DURING training to enforce invariants in real time.

TransXLab Config validated TransXform Training supervised
Visit TransXLab →
Running in 60 seconds
1

Add to Cargo.toml

Cargo.toml
# Core supervisor (no optional deps) transxform = "0.1" # With PyTorch integration + TUI + audit trail transxform = { version = "0.1", features = ["tch", "witness", "merkle"] }
2

Write your spec

bash
# Generate a starter spec $ transxform init --output transxform-spec.yaml
3

Integrate in your training loop

main.rs
use transxform::{Supervisor, Spec}; let spec = Spec::from_file("transxform-spec.yaml")?; let mut supervisor = Supervisor::new(spec); // In your training loop: for step in 0..max_steps { let grads = compute_gradients(&model, &batch); // Supervisor validates before applying let verdict = supervisor.evaluate(&model, &grads, step); match verdict { Verdict::Apply(grads) => optimizer.step(grads), Verdict::Intervene(fix) => fix.apply(&mut model), Verdict::HaltPhase(reason) => break, } }