How Does Gradient Clipping Help with Stabilizing Training?
What happens when your model panics — and how clipping keeps it calm
This is the fundamental loop of instability in modern AI. As models grow deeper and datasets get larger, keeping training stable is a critical challenge. A single unstable gradient from a rare sample can derail days of compute.
In this series, we cover simple tricks that prevent training runs from exploding into NaN. We use toy examples to build intuition before applying these ideas to real systems.
The Problem: Gradient Explosions
Training a neural network means repeatedly updating weights to reduce error. The gradient ∇L contains two pieces of information:
1. Direction
Which way to move weights to fix the error.
2. Magnitude
How urgently (or how far) to move.
When the model sees a rare sample it gets confidently wrong, the loss is massive. The gradient magnitude explodes, screaming at the model to make a huge update.
This happens in three scenarios:
- Rare samples: Outliers the model hasn't learned yet
- Deep networks: Gradients multiply and accumulate through layers
- Small batches: No averaging to smooth out the spikes
A single large step can overshoot the optimal weights, undoing previous work.
The Solution: Gradient Clipping
Gradient clipping is a safety valve: if the gradient is too large, scale it down, but keep it pointing the same way.
Without Clipping
Δw = −η · ∇L
Unbounded. If the gradient is 100, you move 100 steps.
With Clipping
Δw = −η · clip(∇L, τ)
Capped at τ. Same direction, safer speed.
Clipping separates the signal (direction) from the noise (magnitude). It allows the model to learn "I was wrong," without panicking and wrecking the weights to fix it instantly.
Experimental Setup
To prove this, we need a controlled environment where we can force gradient spikes.
Model Architecture
(1 of 4)
4 × 16
16 × 2
(2 classes)
Data Distribution
Class A
990 samples
Class B
10 samples
99:1 class imbalance — Class B samples are "rare events"
| Parameter | Value | Why |
|---|---|---|
| Batch size | 1 | Each gradient spike visible, no averaging |
| Learning rate | 0.01 | Small enough for stable training dynamics |
| Clip threshold | 1.0 | Conservative bound, typical default |
| Optimizer | SGD | No momentum to confound analysis |
This 99:1 imbalance creates a trap: the model quickly learns to predict "Class A" for everything. When it hits a rare Class B sample, it's confidently wrong → massive loss → gradient spike.
Observation 1: The "Panic" vs. The "Nudge"
We tracked the loss specifically when the model encounters those rare Class B samples:

Without Clipping (The Panic): The model sees a rare sample, "panics" (huge update), and drastically changes weights to memorize it. Loss drops fast, but the model becomes unstable.
With Clipping (The Nudge): The gradient is still large, but the update is capped. The model "notices" the error but only takes a small step to fix it.
Counterintuitive result: The loss on rare samples is actually higher with clipping (mean 5.97 vs 4.67). Why? Because we aren't letting the model overreact. We are forcing it to learn gradually.
Observation 2: Visualizing the Spikes
When we look at the gradient norms over time:


Without clipping: You see massive spikes (7× the average) every time a rare sample appears. These spikes "yank" the weights around, destroying the delicate patterns learned from the other 99% of data.
With clipping: The computed gradient is still high (the model is still surprised), but the applied update is flat. The weight norm evolves smoothly.
Observation 3: Stability Quantified
I compared an extreme imbalance (99:1) against a moderate one (80:20). The stability gains in the extreme case are massive:

Clipping doesn't just smooth the curve; it transforms the training dynamics from chaotic to consistent. Here are the numbers for the extreme (99:1) case:
| Metric | Without Clipping | With Clipping | Improvement |
|---|---|---|---|
| Max weight change | 0.131 | 0.022 | 6× |
| Weight norm std | 0.64 | 0.22 | 3× |
| Effective dim variance | 0.336 | 0.023 | 14× |
Observation 4: A Reality Check
Here's the most important lesson: Clipping does not solve class imbalance.

In my experiments, neither model learned to predict the rare class well. Accuracy on Class B stayed near 0%. To fix that, you need techniques like focal loss or oversampling.
Clipping provides stability. It is the infrastructure that allows those other learning techniques to work without crashing the system.
Gradient Clipping in Frontier LLM Research
In today's frontier LLMs, gradient clipping isn't just a legacy trick from early deep learning—it's part of a broader stability stack engineered to keep large-scale transformer training stable.
Modern systems commonly use global norm clipping, where the L2 norm of gradients across all parameters is scaled down if it exceeds a threshold, ensuring that unusually large updates don't destabilize the run. This technique is widely recommended in practice because extremely large gradient values can otherwise lead to divergence or NaNs in deep networks, particularly in transformer architectures used for LLMs [1].
[1] SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training. arXiv:2501.06842v2
Inspired by Ziming Liu's blog. Experiments and post generated with Orchestra.