“Almost all knowledge in large language models is learned during pretraining.” arXiv
Pre-training is where models actually learn.
Fine-tuning is where we aim them.
That’s the whole story in two lines. But let’s unpack it.
First: what “pre-training” really does
Pre-training is self-supervised reading at scale. The model predicts the next token across diverse sources, learning syntax, facts, and latent procedures.
By the end, you don’t get a polite assistant. You get a general engine that can compress text patterns and reason over them. LIMA made this visible: with only ~1,000 high-quality instruction examples on top of a strong base, humans preferred its answers to top systems a surprising amount of the time - because the base already knew the hard stuff.

Why most of the skill lives in pre-training
Two lines of evidence recur:
- Foundation evaluations: families like Llama 2 release both base and chat models. The base model’s broad competence tracks the quality of its pre-training corpus and recipe; chat tuning mostly steers style and safety.
- Minimal-tuning studies: LIMA showed tiny, carefully chosen SFT can unlock capabilities that were already there.
Put simply: pre-training builds capability; fine-tuning reveals it.
Data quality beats raw size (but scale still matters)
The old mental model was “bigger web crawl = better model.” The new model is “better data = better model.”
RefinedWeb showed that properly filtered and deduplicated web data alone can rival curated mixtures and even outperform models trained on The Pile - while extracting trillions of usable tokens from Common Crawl.
Dolma pushed in a different direction: openness and instrumentation. A 3-trillion-token corpus plus tooling lets researchers measure how mix choices shift model behavior - crucial if you care about repeatable performance, not vibes.
D4 adds another lever: what you repeat and how you select. Smart, embedding-based diversification and repetition delivered up to ~20% training-efficiency gains and small but real accuracy lifts versus random replay. That’s free performance if you already pay for compute.
Small models, big gains: when curation outmuscles parameters
Phi-2 (2.7B) is a great case study. Microsoft trained it on curated, “textbook-quality” and filtered synthetic/web data. Result: on complex benchmarks, Phi-2 matches or outperforms models up to 25× larger. That’s pre-training distribution doing heavy lifting.
Gemma (2B/7B) tells the same story from a different lab: lightweight, open checkpoints trained with careful data recipes post competitive scores across understanding and reasoning - again underscoring that curation lets smaller models punch above their weight.
Architecture choices during pre-training matter
Pre-training isn’t just “more tokens.” Two structural levers change downstream behavior:
1) Sparse Mixture-of-Experts (MoE)
Mixtral routes tokens to a small subset of specialized experts per layer, boosting effective capacity without paying the dense compute bill. The paper reports that Mixtral 8×7B outperforms Llama 2-70B and GPT-3.5 on most benchmarks, with speed advantages stemming from sparsity.
Why it helps: MoE lets pre-training specialize sub-networks (math, code, multilingual) yet combine them fluidly at inference. Capacity where you need it, thrift where you don’t.
2) Long-context pre-training
You don’t get reliable 32k context windows by bolting on a rope trick at the end.
You get them by continual pre-training with longer sequences and the right curriculum. Work extending Llama 2 shows robust long-context gains without sacrificing short-context performance when you upsample longer texts and train with longer sequences.
“Support effective context windows of up to 32,768 tokens” via continual long-sequence pre-training. arXiv
The emergence piece: why in-context learning shows up
Transformers don’t just memorize; they learn to learn in context.
ICL work shows models can implement gradient-descent-like rules inside the context window after pre-training - explaining few-shot generalization without weight updates. That’s an emergent consequence of the pre-training objective and data distribution.
This is one reason small, high-quality corpora can be disproportionately valuable: they rehearse the right update-like patterns.
Where fine-tuning really fits
If pre-training builds the engine, fine-tuning is:
- Steering alignment: making the model helpful, honest, harmless. Llama-2-Chat details a robust recipe for this, on top of pre-trained checkpoints.
- Activating latent skills: small, curated SFT can unlock behaviors the base already acquired (LIMA).
- Domain adaptation: e.g., legal or medical tone and format - far easier when the base truly understands the domain’s language.
But fine-tuning won’t save a weak base.
You can’t align knowledge that isn’t there.
Designing a stronger pre-training recipe
Step 1: Start with the objective you’ll live with.
Causal LM is the default; if you need retrieval-heavy workflows, consider corpus structure and long-context from day one.
Step 2: Make data curation a first-class discipline.
- Build a layered pipeline: crawl → normalize → dedup (near/exact) → filter (toxicity, spam, templated boilerplate) → diversify with embeddings.
- Preserve metadata: source, license, timestamp, language, domain, quality score.
Step 3: Balance scale and quality.
- Web-only can work - if it’s filtered and deduped properly (RefinedWeb).
- Add “textbook-quality” pockets or synthetic curricula to raise reasoning yield per token (Phi-2).
Step 4: Decide on architecture early.
- If you’ll need multilingual/code/math headroom under a tight budget, MoE offers a strong capacity-per-FLOP tradeoff (Mixtral).
- If long context is a feature, plan continual pre-training with long sequences rather than retrofitting.
Step 5: Measure transfer, not just perplexity.
Evaluate on the task families you care about (reasoning, code, multilingual) at base-model stage, before any alignment, and keep ablation hooks to your data slices.
Step 6: Treat repetition as a knob.
Intelligent replay (hard examples, diverse embeddings) beats blind extra epochs. It’s efficiency you can bank.
Case mini-studies (why these matter in practice)
A small model with big accuracy
You’re deploying on a single GPU, need crisp task-following and acceptable reasoning. Phi-2-style curation gives a 2–7B model surprising headroom, with fewer ops and a friendlier memory footprint.
A long-context analyst
You want to ingest 100-page PDFs and chat across them. Continual long-sequence pre-training plus modest instruction tuning beat last-minute rope tweaks. Plan for it up front.
A multilingual/code assistant
You need language breadth and coding depth, but inference cost matters. Sparse MoE buys effective capacity without linear compute growth - great for bursty workloads.
Common misconceptions (and fixes)
“We’ll just fine-tune it later.”
You can steer tone later. You can’t conjure knowledge later. Invest in base quality.
“More tokens are always better.”
Not if they’re duplicates or low-value. Clean first; diversify second.
“Web data is too noisy to win.”
It used to be. With modern filtering/dedup, web-only can compete or beat curated blends at scale.
“Long context is an inference hack.”
It’s a training decision. You earn it during pre-training.
If you’re building or buying an LLM, interrogate the pre-training story first.
Ask about corpus composition, dedup and filtering, repetition policy, sequence-length curriculum, and architecture.
Then, and only then, worry about fine-tuning.
Sources
- Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., … Levy, O. (2023). LIMA: Less is more for alignment. arXiv. https://arxiv.org/abs/2305.11206.
- Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv. https://arxiv.org/abs/2307.09288.
- Gemma Team. (2024). Gemma: Open models based on Gemini research and technology. arXiv. https://arxiv.org/abs/2403.08295.
- Microsoft Research. (2023, December 12). Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
- Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., et al. (2023). The RefinedWeb dataset for Falcon LLM. arXiv. https://arxiv.org/abs/2306.01116.
- Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkinson, D., et al. (2024). Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv. https://arxiv.org/abs/2402.00159.
- Tirumala, K., Simig, D., Aghajanyan, A., & Morcos, A. S. (2023). D4: Improving LLM pretraining via document de-duplication and diversification. arXiv. https://arxiv.org/abs/2308.12284.
- Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., et al. (2024). Mixtral of Experts. arXiv. https://arxiv.org/abs/2401.04088.
- Xiong, W., Liu, J., Molybog, I., Zhang, H., Bhargava, P., et al. (2023). Effective long-context scaling of foundation models. arXiv. https://arxiv.org/abs/2309.16039.
- Akyürek, E., Schuurmans, D., Andreas, J., Ma, T., & Zhou, D. (2023). What learning algorithm is in-context learning? Investigations with linear models. ICLR. https://openreview.net/forum?id=0g0X4H8yN4I