boundary-governed transformer training
TransXform interposes a supervisory authority between the optimizer and the model. Gradient updates are provisional state changes subject to post-commit correction. Illegal states cannot persist.
Five non-negotiable constraints enforced every step. Cosine collapse, gradient death, variance collapse, loss explosion, gradient spikes. Illegal states are corrected immediately.
13 diagnostic signals that detect subtle pathologies: unused capacity, shortcut learning, threshold drift, intervention futility, gradient domination, overfitting, and more.
Training phases with readiness gates. The model must prove competence before advancing. No more "let it cook" while the model is silently dying.
Reinitialize frozen submodules. Rescale collapsed representations. Dampen gradient spikes. All while preserving healthy learned structure elsewhere.
Every invariant, threshold, phase, and intervention rule is declared in a YAML spec. Fully auditable, version-controlled, reproducible.
Optional TUI showing all metrics, violations, and interventions in real time. Cryptographic audit trail via Merkle tree for full reproducibility.
# TransXform training spec
invariants:
cosine_collapse:
threshold: 0.95
window: 100
action: reinitialize
gradient_death:
threshold: 1e-7
consecutive_steps: 50
action: rescale
variance_collapse:
min_variance: 0.01
action: perturb
phases:
- name: warmup
max_steps: 2000
gate:
loss_below: 4.0
no_violations_for: 200
- name: core_training
max_steps: 50000
gate:
loss_below: 2.5
capacity_utilization_above: 0.7
witness:
enabled: true
refresh_ms: 250
merkle_audit:
enabled: true
output: ./audit/run_{timestamp}.merkle
Each invariant defines a threshold, a detection window, and an automatic intervention. When cosine similarity exceeds 0.95 across representations, the supervisor reinitializes the collapsed component. No human required.
The model cannot advance from warmup to core training until it proves competence: loss below target AND no invariant violations for 200 consecutive steps. Hope is not a strategy.
Every supervisor decision — every violation detected, every intervention applied, every phase gate evaluated — is recorded in a Merkle-chained log. Cryptographic proof of what happened and why.
The stability loss term was rewarding the model for producing identical outputs regardless of input. Loss curve looked healthy. The model was learning to do nothing, confidently.
The emission heads collapsed to a single degenerate point. Zero gradient. The optimizer saw convergence. The model had lost the ability to produce varied outputs.
The recurrent pathway was overwriting the input signal entirely. Each recurrence step made the representation less informative. 14,000 steps of compute, wasted.
Cosine collapse detection catches emission collapse at step 6,000. Variance monitoring catches the trivial convergence at step 4,000. Gradient death detection catches the frozen recurrence. Each violation triggers an immediate, targeted intervention. The training self-corrects.
Set transformer on 30K AEONCORE reasoning traces. 76,647 steps, 817K training samples.
Discovery: Epoch-boundary representation collapse — three components simultaneously spiked to cosine 0.976–0.999 at step 25,548 and 51,097. TransXform detected all instances without being programmed to look for them. Recovery within 100 steps.
24-layer transformer, text→semantic packet generation. Two runs: V1.2 failed, V1.3 succeeded.
The lesson: Run 7 (V1.2) transitioned too early with cosine 0.9975 into a 0.98 threshold — immediate violation cascade — death. Run 8 (V1.3) readiness gate held the model 300 extra steps. Zero violations. "You didn't change the model. We changed the timing of pressure."
Intentionally pathological: 192→32 bottleneck, 3 competing heads, 400:1 gradient ratio. Designed to be unsalvageable.
Negative capability: 29 violations, 12 interventions exhausted, clean abort at step 221. Signal 10 (InterventionFutility) predicted abort 76 steps early. Signal 11 identified 102x gradient domination ratio.
2-layer transformer with gradual label leak — slow poison ramping 0→5.0 over 15K steps. Zero hard violations. V1 sees nothing wrong.
V2 catches it: Step 7,900 — ShortcutLearning advisory fires. "Loss improved 37.7% while activation variance increased 63%." Variance eventually exploded to 57x baseline. Loss hit 0.000056. The model fully memorized the shortcut.
Clean architecture on progressively noisy data (0%→90% label noise over 15K steps). Gradients stay healthy the entire time. Loss plateaus. Standard tooling sees nothing wrong — gradients are flowing, loss isn't exploding.
Signal 7 fires at step 2,400: "Loss has stagnated at 0.5796 for 2,089 steps despite healthy gradient flow. The data signal-to-noise ratio may be too low." — 33,780 total steps, zero hard violations, zero interventions. The advisory was correct and sufficient.
TransXLab catches bad configurations, incompatible hyperparameters, and doomed setups before you burn a single GPU hour. TransXform supervises DURING training to enforce invariants in real time.
# Core supervisor (no optional deps)
transxform = "0.1"
# With PyTorch integration + TUI + audit trail
transxform = { version = "0.1", features = ["tch", "witness", "merkle"] }
# Generate a starter spec
$ transxform init --output transxform-spec.yaml
use transxform::{Supervisor, Spec};
let spec = Spec::from_file("transxform-spec.yaml")?;
let mut supervisor = Supervisor::new(spec);
// In your training loop:
for step in 0..max_steps {
let grads = compute_gradients(&model, &batch);
// Supervisor validates before applying
let verdict = supervisor.evaluate(&model, &grads, step);
match verdict {
Verdict::Apply(grads) => optimizer.step(grads),
Verdict::Intervene(fix) => fix.apply(&mut model),
Verdict::HaltPhase(reason) => break,
}
}