How Does Gradient Clipping Help with Stabilizing Training?

“

This is the fundamental loop of instability in modern AI. As models grow deeper and datasets get larger, keeping training stable is a critical challenge. A single unstable gradient from a rare sample can derail days of compute.

In this series, we cover simple tricks that prevent training runs from exploding into NaN. We use toy examples to build intuition before applying these ideas to real systems.

The Problem: Gradient Explosions

Training a neural network means repeatedly updating weights to reduce error. The gradient ∇L contains two pieces of information:

1. Direction

Which way to move weights to fix the error.

2. Magnitude

How urgently (or how far) to move.

When the model sees a rare sample it gets confidently wrong, the loss is massive. The gradient magnitude explodes, screaming at the model to make a huge update.

This happens in three scenarios:

Rare samples: Outliers the model hasn't learned yet
Deep networks: Gradients multiply and accumulate through layers
Small batches: No averaging to smooth out the spikes

A single large step can overshoot the optimal weights, undoing previous work.

The Solution: Gradient Clipping

Gradient clipping is a safety valve: if the gradient is too large, scale it down, but keep it pointing the same way.

Without Clipping

Δw = −η · ∇L

Unbounded. If the gradient is 100, you move 100 steps.

With Clipping

Δw = −η · clip(∇L, τ)

Capped at τ. Same direction, safer speed.

Clipping separates the signal (direction) from the noise (magnitude). It allows the model to learn "I was wrong," without panicking and wrecking the weights to fix it instantly.

Experimental Setup

To prove this, we need a controlled environment where we can force gradient spikes.

Model Architecture

Input Token
(1 of 4)

→

Embedding
4 × 16

→

Linear
16 × 2

→

Logits
(2 classes)

Data Distribution

Class A
990 samples

Class B
10 samples

99:1 class imbalance — Class B samples are "rare events"

Parameter	Value	Why
Batch size	1	Each gradient spike visible, no averaging
Learning rate	0.01	Small enough for stable training dynamics
Clip threshold	1.0	Conservative bound, typical default
Optimizer	SGD	No momentum to confound analysis

This 99:1 imbalance creates a trap: the model quickly learns to predict "Class A" for everything. When it hits a rare Class B sample, it's confidently wrong → massive loss → gradient spike.

Observation 1: The "Panic" vs. The "Nudge"

We tracked the loss specifically when the model encounters those rare Class B samples:

Detailed view of model behavior at rare sample positions — Dynamics at rare sample positions across 10 epochs of training.

Without Clipping (The Panic): The model sees a rare sample, "panics" (huge update), and drastically changes weights to memorize it. Loss drops fast, but the model becomes unstable.

With Clipping (The Nudge): The gradient is still large, but the update is capped. The model "notices" the error but only takes a small step to fix it.

Counterintuitive result: The loss on rare samples is actually higher with clipping (mean 5.97 vs 4.67). Why? Because we aren't letting the model overreact. We are forcing it to learn gradually.

Observation 2: Visualizing the Spikes

When we look at the gradient norms over time:

Training without gradient clipping showing gradient spikes — Without clipping — gradient spikes at rare samples

Training with gradient clipping showing bounded gradients — With clipping (τ=1.0) — bounded updates

Without clipping: You see massive spikes (7× the average) every time a rare sample appears. These spikes "yank" the weights around, destroying the delicate patterns learned from the other 99% of data.

With clipping: The computed gradient is still high (the model is still surprised), but the applied update is flat. The weight norm evolves smoothly.

Observation 3: Stability Quantified

I compared an extreme imbalance (99:1) against a moderate one (80:20). The stability gains in the extreme case are massive:

Comprehensive comparison of training with and without gradient clipping across different class imbalances — Comparing extreme (99:1, left) vs moderate (80:20, right) class imbalance. Red: without clipping. Green: with clipping.

Clipping doesn't just smooth the curve; it transforms the training dynamics from chaotic to consistent. Here are the numbers for the extreme (99:1) case:

Metric	Without Clipping	With Clipping	Improvement
Max weight change	0.131	0.022	6×
Weight norm std	0.64	0.22	3×
Effective dim variance	0.336	0.023	14×

Observation 4: A Reality Check

Here's the most important lesson: Clipping does not solve class imbalance.

Class accuracy comparison showing rare class learning — Per-class accuracy during training. Common class (left) vs. rare class (right).

In my experiments, neither model learned to predict the rare class well. Accuracy on Class B stayed near 0%. To fix that, you need techniques like focal loss or oversampling.

Clipping provides stability. It is the infrastructure that allows those other learning techniques to work without crashing the system.

Gradient Clipping in Frontier LLM Research

In today's frontier LLMs, gradient clipping isn't just a legacy trick from early deep learning—it's part of a broader stability stack engineered to keep large-scale transformer training stable.

Modern systems commonly use global norm clipping, where the L2 norm of gradients across all parameters is scaled down if it exceeds a threshold, ensuring that unusually large updates don't destabilize the run. This technique is widely recommended in practice because extremely large gradient values can otherwise lead to divergence or NaNs in deep networks, particularly in transformer architectures used for LLMs [1].

[1] SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training. arXiv:2501.06842v2

View Demo Code Repository

Inspired by Ziming Liu's blog. Experiments and post generated with Orchestra.