Teaching Small Models to Learn Faster: Meta-Learning in Pretraining
We embedded a well-known "learn-to-learn" algorithm (MAML) directly into the pretraining loop of Llama-style decoder models, ranging from 11M to 570M parameters. Here's what we found, and why it matters.
David Africa
Read