All posts

Why ReLoRA Struggles with Small Language Models

Parameter-efficient training methods have transformed how we work with large language models. LoRA, which fine-tunes only a small set of low-rank matrix updates while keeping the rest of the model frozen, has become a staple of the modern ML toolkit. Its extension, **ReLoRA**, pushed this idea one step further: instead of just fine-tuning with low-rank updates, why not *pretrain* with them? Turns out it's not that simple.

Yuval Weiss

Parameter-efficient training methods have transformed how we work with large language models. LoRA, which fine-tunes only a small set of low-rank matrix updates while keeping the rest of the model frozen, has become a staple of the modern ML toolkit. Its extension, ReLoRA, pushed this idea one step further: instead of just fine-tuning with low-rank updates, why not pretrain with them? By periodically merging and reinitialising low-rank adapters throughout training, ReLoRA can accumulate a high-rank effective update over time while keeping per-step compute cheap.

It's a clever idea — and it works well for models in the hundreds of millions to billions of parameters. But what happens when you try it on small language models (SLMs) of 11M to 66M parameters? That's the question we set out to answer.

The short answer: it doesn't go well. ReLoRA consistently hurts SLM performance, and our learning dynamics analysis reveals exactly why.


The Setup: LoRA, ReLoRA, and the Rank Perspective

To understand what's going wrong, it helps to understand what ReLoRA is actually doing at a mathematical level.

LoRA works by decomposing weight updates into two small matrices: ΔW=sWBWA\Delta W = s W_B W_A, where rmin(m,n)r \ll \min(m, n) is a small "rank" parameter. Rather than updating a full m×nm \times n weight matrix, you only train two thin matrices — drastically reducing the number of trainable parameters.

ReLoRA builds on a simple algebraic fact: even if each individual update is low-rank, the sum of many low-rank updates can be high-rank. So ReLoRA periodically "commits" each LoRA module back into the base weights and reinitialises it, accumulating a richer overall update than any single low-rank step could achieve.

ΔWeff=si=1NWBiWAi\Delta W_{\text{eff}} = s \sum_{i=1}^{N} W_B^i W_A^i

This strategy was motivated by observations that large transformers already learn through locally low-rank trajectories that gradually expand — so structuring pretraining around that pattern seemed like a natural fit.

The question is whether that intuition still holds when your model only has 11 or 66 million parameters.


Why Small Models Are Different

Here's the key tension. Work from our group and others has shown that SLMs don't use their representational capacity very efficiently. Their weight matrices tend to be rank-deficient — the effective rank of key matrices is substantially lower than their nominal dimension — and their internal representations are anisotropic, meaning token embeddings cluster tightly rather than spreading across the full representational space.

This creates two competing hypotheses:

The optimistic case (Boost): ReLoRA's rank-expanding mechanism might be exactly what SLMs need. If their bottleneck is staying trapped in low-rank subspaces, periodically injecting new low-rank directions could widen those bottlenecks and improve learning.

The pessimistic case (Drag): Repeatedly projecting gradients through a low-rank bottleneck might worsen the rank deficiency problem, since SLMs lack the redundancy that allows larger models to absorb such constraints.

To test this, we trained matched pairs of models — our Llama-style pico-decoder baseline versus pico-relora — at both the tiny (11M) and small (66M) parameter scales, each for 20,000 steps on approximately 42 billion tokens from the Dolma dataset.


What We Found: The Drag Dominates

Across every metric we measured — cross-entropy loss, Paloma perplexity (a broad 546-domain benchmark), and BLiMP (a grammatical acceptability evaluation) — ReLoRA underperformed full-rank training. The gap was small but present at 11M parameters, and widened substantially at 66M.

The BLiMP results are especially notable: the performance difference is statistically significant at the 101010^{-10} level, with ReLoRA falling behind by a meaningful margin on linguistic understanding tasks.

One mildly interesting wrinkle: the tiny ReLoRA model avoided a loss spike that the baseline decoder hit around mid-training (likely a local minimum artefact), which hints at some stabilising effect. But this is a single-run observation, and the overall pattern is clear — ReLoRA is a drag on SLM pretraining.


Looking Under the Hood: Rank and Conditioning

The performance numbers tell us that something is wrong. Our learning dynamics analysis tells us why.

Effective Rank is Declining

We tracked proportional effective rank (PER) — a scale-normalised measure of how many "useful" dimensions a weight matrix actually uses — across the OV circuits (how attention heads write to the residual stream) and SwiGLU feed-forward layers throughout training.

The results are striking. While pico-decoder's PER stays roughly flat or increases, pico-relora's PER steadily declines over training. The model is becoming progressively more rank-deficient as training proceeds — the opposite of what we were hoping for. This effect is consistent across layers and gets worse at larger scale.

The same pattern shows up in the gradient updates: the LoRA modules that drive learning in ReLoRA have substantially lower PER than the full-rank gradients of the baseline. ReLoRA isn't just failing to expand rank — it's actively compressing it.

Gradient Updates are Severely Ill-Conditioned

The condition number of a matrix (κ=σmax/σmin\kappa = \sigma_{\max} / \sigma_{\min}) tells you how sensitive the linear system it defines is to small perturbations. A high condition number means numerical errors get amplified — if κ=10k\kappa = 10^k, you lose at least kk digits of precision.

For pico-relora, the condition numbers of gradient updates are catastrophically large early in training — reaching 10710^710810^8 for some projections at the small scale. This means the model is losing 7–8 digits of precision in its updates, far exceeding what float16 or even float32 arithmetic can reliably represent. While these values do eventually come down toward baseline levels, the early training instability compounds with the existing rank deficiency problems.

Why These Two Effects Compound

SLMs already have anisotropic representations — their token embeddings are unevenly clustered, so small perturbations to an input can send it into a sparsely populated region of representation space. When you combine this with highly ill-conditioned gradient updates, the model becomes extremely sensitive to input fluctuations. Tokens with similar internal representations can produce wildly different outputs, and the model can't easily correct for this because its limited capacity is being further constrained by the low-rank update structure.


The Broader Implication

Our results highlight something important: parameter-efficient pretraining doesn't scale down trivially.

In billion-parameter models, there's enough redundancy that projecting gradients through a rank-rr bottleneck loses relatively little signal — the model has many other pathways to work with. In an 11M or 66M parameter model, every dimension counts. Imposing a low-rank constraint means genuinely losing information that the model doesn't have spare capacity to recover.

This also raises a more basic question: do SLMs actually need parameter-efficient pretraining? A 66M parameter model can be trained on a single modern GPU. The compute savings that motivate LoRA-style approaches in the multi-billion-parameter regime just aren't as compelling here — and as our results show, they come at a real quality cost.


What Should We Try Instead?

Our results motivate hybrid or adaptive-rank approaches. Rather than committing to low-rank updates throughout training, you might:

  • Use selective full-rank updates for the most rank-deficient layers while keeping LoRA elsewhere
  • Adopt DyLoRA-style dynamic rank adaptation that adjusts the rank budget based on observed learning dynamics
  • Use full-rank training during early stages (when ill-conditioning is worst) and switch to low-rank only later

More broadly, any parameter-efficient method for SLMs needs to explicitly preserve the rank structure of updates, or find ways to compensate for the rank deficiencies that these small models are prone to.


Code and Resources

All code is available in our public fork pico-relora, which extends pico-train with ReLoRA support. We've also released a HuggingFace Space for evaluating language models on BLiMP via the evaluate library.

The paper was presented at BlackboxNLP 2025 and is available on ACL Anthology.