All posts

Teaching Small Models to Learn Faster: Meta-Learning in Pretraining

We embedded a well-known "learn-to-learn" algorithm (MAML) directly into the pretraining loop of Llama-style decoder models, ranging from 11M to 570M parameters. Here's what we found, and why it matters.

David Africa

Large language models keep getting bigger, and with them, the compute bills. At Pico, we're interested in a different question: can we make small language models smarter by changing how they learn, rather than just throwing more data or parameters at the problem?

Our new paper, Learning Dynamics of Meta-Learning in Small Model Pretraining, presented at IJCNLP-AACL 2025, takes a step in that direction. We embedded a well-known "learn-to-learn" algorithm — MAML — directly into the pretraining loop of Llama-style decoder models, ranging from 11M to 570M parameters. Here's what we found, and why it matters.


The Core Idea: Learning-to-Learn During Pretraining

Most language models are trained with a single objective: predict the next token. It's simple and effective, but it doesn't give the model any explicit practice at adapting quickly to new tasks. Meta-learning — and specifically Model-Agnostic Meta-Learning (MAML) — is designed to do exactly that. The goal is to find model weights that serve as a great starting point, such that a small number of gradient steps on any new task leads to strong performance.

MAML has worked well in computer vision and reinforcement learning. Its use in NLP has mostly been limited to fine-tuning large pretrained encoders like BERT. We wanted to know: what happens if you bake meta-learning directly into the pretraining of a small decoder model, right from scratch?

To construct the meta-learning episodes, we used a technique called Subset-Masked Language Modelling Tasks (SMLMT). The idea is elegant: pick a small set of words, mask each one in context, and ask the model to identify which word belongs in which sentence — a few-shot classification task built entirely from unlabelled text, with no human annotation required. We interleaved these episodic tasks with standard next-token prediction throughout pretraining.

Crucially, only a tiny classification head is updated in each inner loop step, leaving the backbone weights stable. This makes the training dynamics interpretable: we can track exactly how the backbone representations evolve without gradient noise from the inner loop contaminating our view.


What We Found

1. Faster convergence, with a trade-off

The most immediate result is that MAML-pretrained models reach the same training loss up to 1.6× faster than vanilla models. That's a meaningful speedup under fixed compute.

The catch: perplexity on Paloma — a diverse language modelling benchmark spanning 18 domains — is worse for most model sizes at the 6,000-step mark. The episodic objective seems to sharpen the model's adaptation machinery at the expense of broad distributional fluency. Whether this gap would close with longer training is an open question we're actively investigating.

2. Modest but consistent downstream gains on NER

We evaluated all models on Universal NER, a multilingual named entity recognition benchmark spanning languages from Danish to Cebuano. MAML-pretrained models achieve +2–3 percentage points on average over vanilla training at medium and large scales under full fine-tuning. Small and tiny models are more erratic — the gains depend heavily on whether the model has enough capacity to encode a reusable inductive bias across episodes.

The most pronounced gains show up in head-only fine-tuning (where the backbone is frozen), which directly mirrors the inner-loop setup from pretraining. This is evidence that the episodic training isn't just improving optimization — it's shaping the geometry of the representations in a way that makes them more transferable.

Interestingly, MAML provides surprisingly strong zero-shot transfer gains on low-resource languages like Tagalog and Cebuano, especially for small and medium models in the head-only regime. The episodic inductive bias appears to promote language-agnostic representations that generalize beyond English, even when the entire pretraining corpus was English.

3. A phase transition in the large model

Perhaps the most striking finding is a representational phase transition observed in the 570M model, visible in the figure below.

Around training step 3,000, three things happen in rapid succession:

  • Proportional Effective Rank (PER) — a measure of how many dimensions the model's representations actively use — drops sharply.
  • Paloma perplexity also drops sharply after an initial plateau.
  • Query accuracy (how well the model generalizes from support to novel query examples in each episode) jumps abruptly from near-random to above 50%.

We interpret this as a diversify-then-compress pattern: the model first explores a broad, high-dimensional representational space, then collapses into a more structured, lower-dimensional regime that is better suited to few-shot generalization. This mirrors theoretical accounts of phase transitions in neural network training — the model transitions from brute-force fitting to something more like compressed, algorithmic abstraction.

No comparable transition is visible in vanilla-trained models of the same size. This suggests that the bilevel structure of MAML's objective is doing real work in reorganizing the loss landscape.


What It Means for Pico

This work was built entirely on the Pico framework — a lightweight, open-source suite for studying language model learning dynamics. We released all four MAML-pretrained checkpoints (11M, 65M, 181M, 570M), the training code, and full checkpoint logs including singular-value spectra, head entropies, and episodic accuracies.

One of Pico's core goals is to make the internals of training legible. This paper is a direct demonstration of that philosophy: by tracking spectral structure and episodic performance throughout pretraining, we could observe a phase transition that would be completely invisible from training loss alone. We think spectral diagnostics — like the PER collapse we observed — have real potential as self-supervised early-stopping or curriculum signals in future work.


Caveats and What's Next

We want to be upfront about the limitations. Every result here is from a single random seed per condition (GPU budget is real). We evaluated on NER only, so generalisation to reasoning or generation-quality tasks is unknown. The pretraining corpus is entirely English, which likely caps the cross-lingual benefits we observed. And training stopped at 6,000 steps — potentially before the largest models had fully converged.

With those caveats in mind, the agenda for follow-up work is clear:

  • Multi-seed runs and hyperparameter sweeps to put error bars on the gains and better understand the inner-loop design space (episode frequency, inner learning rate, number of shots).
  • Multilingual pretraining to test whether the phase transition re-emerges and whether cross-script transfer improves.
  • Varied adaptation targets — which layers should the inner loop update? Is it better to adapt attention, FFN, or both?
  • Non-NER evaluation on classification, question answering, and reasoning tasks to understand how broadly the episodic inductive bias generalises.
  • Spectral early stopping — using the PER collapse as an automatic signal for curriculum transitions.