How Pre-training boosts AI performance

(Updated August 2025)

“Almost all knowledge in large language models is learned during pretraining.” arXiv

Pre-training is where models actually learn.
Fine-tuning is where we aim them.

That’s the whole story in two lines. But let’s unpack it.

First: what “pre-training” really does

Pre-training is self-supervised reading at scale. The model predicts the next token across diverse sources, learning syntax, facts, and latent procedures.

By the end, you don’t get a polite assistant. You get a general engine that can compress text patterns and reason over them. LIMA made this visible: with only ~1,000 high-quality instruction examples on top of a strong base, humans preferred its answers to top systems a surprising amount of the time - because the base already knew the hard stuff.

Why most of the skill lives in pre-training

Two lines of evidence recur:

Foundation evaluations: families like Llama 2 release both base and chat models. The base model’s broad competence tracks the quality of its pre-training corpus and recipe; chat tuning mostly steers style and safety.
Minimal-tuning studies: LIMA showed tiny, carefully chosen SFT can unlock capabilities that were already there.

Put simply: pre-training builds capability; fine-tuning reveals it.

Data quality beats raw size (but scale still matters)

The old mental model was “bigger web crawl = better model.” The new model is “better data = better model.”

RefinedWeb showed that properly filtered and deduplicated web data alone can rival curated mixtures and even outperform models trained on The Pile - while extracting trillions of usable tokens from Common Crawl.

Dolma pushed in a different direction: openness and instrumentation. A 3-trillion-token corpus plus tooling lets researchers measure how mix choices shift model behavior - crucial if you care about repeatable performance, not vibes.

D4 adds another lever: what you repeat and how you select. Smart, embedding-based diversification and repetition delivered up to ~20% training-efficiency gains and small but real accuracy lifts versus random replay. That’s free performance if you already pay for compute.

💡

- Dedup aggressively; repetition should be intentional, not accidental. arXiv
- Track provenance at the slice level; you’ll need it for audits and ablations.
- Keep “textbook-quality” pockets in the mix - reasoning benefits persist even at small sizes. Microsoft

Small models, big gains: when curation outmuscles parameters

Phi-2 (2.7B) is a great case study. Microsoft trained it on curated, “textbook-quality” and filtered synthetic/web data. Result: on complex benchmarks, Phi-2 matches or outperforms models up to 25× larger. That’s pre-training distribution doing heavy lifting.

Gemma (2B/7B) tells the same story from a different lab: lightweight, open checkpoints trained with careful data recipes post competitive scores across understanding and reasoning - again underscoring that curation lets smaller models punch above their weight.

Architecture choices during pre-training matter

Pre-training isn’t just “more tokens.” Two structural levers change downstream behavior:

1) Sparse Mixture-of-Experts (MoE)

Mixtral routes tokens to a small subset of specialized experts per layer, boosting effective capacity without paying the dense compute bill. The paper reports that Mixtral 8×7B outperforms Llama 2-70B and GPT-3.5 on most benchmarks, with speed advantages stemming from sparsity.

Why it helps: MoE lets pre-training specialize sub-networks (math, code, multilingual) yet combine them fluidly at inference. Capacity where you need it, thrift where you don’t.

2) Long-context pre-training

You don’t get reliable 32k context windows by bolting on a rope trick at the end.
You get them by continual pre-training with longer sequences and the right curriculum. Work extending Llama 2 shows robust long-context gains without sacrificing short-context performance when you upsample longer texts and train with longer sequences.

“Support effective context windows of up to 32,768 tokens” via continual long-sequence pre-training. arXiv

The emergence piece: why in-context learning shows up

Transformers don’t just memorize; they learn to learn in context.
ICL work shows models can implement gradient-descent-like rules inside the context window after pre-training - explaining few-shot generalization without weight updates. That’s an emergent consequence of the pre-training objective and data distribution.

This is one reason small, high-quality corpora can be disproportionately valuable: they rehearse the right update-like patterns.

Where fine-tuning really fits

If pre-training builds the engine, fine-tuning is:

Steering alignment: making the model helpful, honest, harmless. Llama-2-Chat details a robust recipe for this, on top of pre-trained checkpoints.
Activating latent skills: small, curated SFT can unlock behaviors the base already acquired (LIMA).
Domain adaptation: e.g., legal or medical tone and format - far easier when the base truly understands the domain’s language.

But fine-tuning won’t save a weak base.
You can’t align knowledge that isn’t there.

Designing a stronger pre-training recipe

Step 1: Start with the objective you’ll live with.
Causal LM is the default; if you need retrieval-heavy workflows, consider corpus structure and long-context from day one.

Step 2: Make data curation a first-class discipline.

Build a layered pipeline: crawl → normalize → dedup (near/exact) → filter (toxicity, spam, templated boilerplate) → diversify with embeddings.
Preserve metadata: source, license, timestamp, language, domain, quality score.

Step 3: Balance scale and quality.

Web-only can work - if it’s filtered and deduped properly (RefinedWeb).
Add “textbook-quality” pockets or synthetic curricula to raise reasoning yield per token (Phi-2).

Step 4: Decide on architecture early.

If you’ll need multilingual/code/math headroom under a tight budget, MoE offers a strong capacity-per-FLOP tradeoff (Mixtral).
If long context is a feature, plan continual pre-training with long sequences rather than retrofitting.

Step 5: Measure transfer, not just perplexity.
Evaluate on the task families you care about (reasoning, code, multilingual) at base-model stage, before any alignment, and keep ablation hooks to your data slices.

Step 6: Treat repetition as a knob.
Intelligent replay (hard examples, diverse embeddings) beats blind extra epochs. It’s efficiency you can bank.

Case mini-studies (why these matter in practice)

A small model with big accuracy
You’re deploying on a single GPU, need crisp task-following and acceptable reasoning. Phi-2-style curation gives a 2–7B model surprising headroom, with fewer ops and a friendlier memory footprint.

A long-context analyst
You want to ingest 100-page PDFs and chat across them. Continual long-sequence pre-training plus modest instruction tuning beat last-minute rope tweaks. Plan for it up front.

A multilingual/code assistant
You need language breadth and coding depth, but inference cost matters. Sparse MoE buys effective capacity without linear compute growth - great for bursty workloads.

Common misconceptions (and fixes)

“We’ll just fine-tune it later.”
You can steer tone later. You can’t conjure knowledge later. Invest in base quality.

“More tokens are always better.”
Not if they’re duplicates or low-value. Clean first; diversify second.

“Web data is too noisy to win.”
It used to be. With modern filtering/dedup, web-only can compete or beat curated blends at scale.

“Long context is an inference hack.”
It’s a training decision. You earn it during pre-training.

If you’re building or buying an LLM, interrogate the pre-training story first.
Ask about corpus composition, dedup and filtering, repetition policy, sequence-length curriculum, and architecture.

Then, and only then, worry about fine-tuning.

Sources

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., … Levy, O. (2023). LIMA: Less is more for alignment. arXiv. https://arxiv.org/abs/2305.11206.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv. https://arxiv.org/abs/2307.09288.
Gemma Team. (2024). Gemma: Open models based on Gemini research and technology. arXiv. https://arxiv.org/abs/2403.08295.
Microsoft Research. (2023, December 12). Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., et al. (2023). The RefinedWeb dataset for Falcon LLM. arXiv. https://arxiv.org/abs/2306.01116.
Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkinson, D., et al. (2024). Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv. https://arxiv.org/abs/2402.00159.
Tirumala, K., Simig, D., Aghajanyan, A., & Morcos, A. S. (2023). D4: Improving LLM pretraining via document de-duplication and diversification. arXiv. https://arxiv.org/abs/2308.12284.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., et al. (2024). Mixtral of Experts. arXiv. https://arxiv.org/abs/2401.04088.
Xiong, W., Liu, J., Molybog, I., Zhang, H., Bhargava, P., et al. (2023). Effective long-context scaling of foundation models. arXiv. https://arxiv.org/abs/2309.16039.
Akyürek, E., Schuurmans, D., Andreas, J., Ma, T., & Zhou, D. (2023). What learning algorithm is in-context learning? Investigations with linear models. ICLR. https://openreview.net/forum?id=0g0X4H8yN4I

How Pre-training boosts AI performance

First: what “pre-training” really does

Why most of the skill lives in pre-training

Data quality beats raw size (but scale still matters)

Small models, big gains: when curation outmuscles parameters