(Updated August 2025)

“Almost all knowledge in large language models is learned during pretraining.” arXiv

Pre-training is where models actually learn.
Fine-tuning is where we aim them.

That’s the whole story in two lines. But let’s unpack it.

First: what “pre-training” really does

Pre-training is self-supervised reading at scale. The model predicts the next token across diverse sources, learning syntax, facts, and latent procedures.

By the end, you don’t get a polite assistant. You get a general engine that can compress text patterns and reason over them. LIMA made this visible: with only ~1,000 high-quality instruction examples on top of a strong base, humans preferred its answers to top systems a surprising amount of the time - because the base already knew the hard stuff.

Mastering AI Prompts: The Power of Context and Clarity
Unlock AI’s potential with effective prompts. Learn the basics: context and clarity.

Why most of the skill lives in pre-training

Two lines of evidence recur:

  • Foundation evaluations: families like Llama 2 release both base and chat models. The base model’s broad competence tracks the quality of its pre-training corpus and recipe; chat tuning mostly steers style and safety.
  • Minimal-tuning studies: LIMA showed tiny, carefully chosen SFT can unlock capabilities that were already there.

Put simply: pre-training builds capability; fine-tuning reveals it.

Data quality beats raw size (but scale still matters)

The old mental model was “bigger web crawl = better model.” The new model is “better data = better model.”

RefinedWeb showed that properly filtered and deduplicated web data alone can rival curated mixtures and even outperform models trained on The Pile - while extracting trillions of usable tokens from Common Crawl.

Dolma pushed in a different direction: openness and instrumentation. A 3-trillion-token corpus plus tooling lets researchers measure how mix choices shift model behavior - crucial if you care about repeatable performance, not vibes.

D4 adds another lever: what you repeat and how you select. Smart, embedding-based diversification and repetition delivered up to ~20% training-efficiency gains and small but real accuracy lifts versus random replay. That’s free performance if you already pay for compute.

💡
- Dedup aggressively; repetition should be intentional, not accidental. arXiv
- Track provenance at the slice level; you’ll need it for audits and ablations.
- Keep “textbook-quality” pockets in the mix - reasoning benefits persist even at small sizes. Microsoft

Small models, big gains: when curation outmuscles parameters

Phi-2 (2.7B) is a great case study. Microsoft trained it on curated, “textbook-quality” and filtered synthetic/web data. Result: on complex benchmarks, Phi-2 matches or outperforms models up to 25× larger. That’s pre-training distribution doing heavy lifting.

Gemma (2B/7B) tells the same story from a different lab: lightweight, open checkpoints trained with careful data recipes post competitive scores across understanding and reasoning - again underscoring that curation lets smaller models punch above their weight.

Architecture choices during pre-training matter

Pre-training isn’t just “more tokens.” Two structural levers change downstream behavior:

1) Sparse Mixture-of-Experts (MoE)

Mixtral routes tokens to a small subset of specialized experts per layer, boosting effective capacity without paying the dense compute bill. The paper reports that Mixtral 8×7B outperforms Llama 2-70B and GPT-3.5 on most benchmarks, with speed advantages stemming from sparsity.

Why it helps: MoE lets pre-training specialize sub-networks (math, code, multilingual) yet combine them fluidly at inference. Capacity where you need it, thrift where you don’t.

2) Long-context pre-training

You don’t get reliable 32k context windows by bolting on a rope trick at the end.
You get them by continual pre-training with longer sequences and the right curriculum. Work extending Llama 2 shows robust long-context gains without sacrificing short-context performance when you upsample longer texts and train with longer sequences.

“Support effective context windows of up to 32,768 tokens” via continual long-sequence pre-training. arXiv

The emergence piece: why in-context learning shows up

Transformers don’t just memorize; they learn to learn in context.
ICL work shows models can implement gradient-descent-like rules inside the context window after pre-training - explaining few-shot generalization without weight updates. That’s an emergent consequence of the pre-training objective and data distribution.

This is one reason small, high-quality corpora can be disproportionately valuable: they rehearse the right update-like patterns.

Where fine-tuning really fits

If pre-training builds the engine, fine-tuning is:

  • Steering alignment: making the model helpful, honest, harmless. Llama-2-Chat details a robust recipe for this, on top of pre-trained checkpoints.
  • Activating latent skills: small, curated SFT can unlock behaviors the base already acquired (LIMA).
  • Domain adaptation: e.g., legal or medical tone and format - far easier when the base truly understands the domain’s language.

But fine-tuning won’t save a weak base.
You can’t align knowledge that isn’t there.

Master Prompts: A Guide to Improve Your Dialogue with AI
Master the art of interacting with ChatGPT by understanding the function and significance of prompts, and explore techniques for crafting superior prompts for improved AI dialogue.

Designing a stronger pre-training recipe

Step 1: Start with the objective you’ll live with.
Causal LM is the default; if you need retrieval-heavy workflows, consider corpus structure and long-context from day one.

Step 2: Make data curation a first-class discipline.

  • Build a layered pipeline: crawl → normalize → dedup (near/exact) → filter (toxicity, spam, templated boilerplate) → diversify with embeddings.
  • Preserve metadata: source, license, timestamp, language, domain, quality score.

Step 3: Balance scale and quality.

  • Web-only can work - if it’s filtered and deduped properly (RefinedWeb).
  • Add “textbook-quality” pockets or synthetic curricula to raise reasoning yield per token (Phi-2).

Step 4: Decide on architecture early.

  • If you’ll need multilingual/code/math headroom under a tight budget, MoE offers a strong capacity-per-FLOP tradeoff (Mixtral).
  • If long context is a feature, plan continual pre-training with long sequences rather than retrofitting.

Step 5: Measure transfer, not just perplexity.
Evaluate on the task families you care about (reasoning, code, multilingual) at base-model stage, before any alignment, and keep ablation hooks to your data slices.

Step 6: Treat repetition as a knob.
Intelligent replay (hard examples, diverse embeddings) beats blind extra epochs. It’s efficiency you can bank.

Identifying & Understanding Bias: Why You should Care
Bias in AI happens all the time. But do you know how it can affect you? Learn what work is being done to understand AI Bias and what comes next. Updated for 2024.

Case mini-studies (why these matter in practice)

A small model with big accuracy
You’re deploying on a single GPU, need crisp task-following and acceptable reasoning. Phi-2-style curation gives a 2–7B model surprising headroom, with fewer ops and a friendlier memory footprint.

A long-context analyst
You want to ingest 100-page PDFs and chat across them. Continual long-sequence pre-training plus modest instruction tuning beat last-minute rope tweaks. Plan for it up front.

A multilingual/code assistant
You need language breadth and coding depth, but inference cost matters. Sparse MoE buys effective capacity without linear compute growth - great for bursty workloads.

Common misconceptions (and fixes)

“We’ll just fine-tune it later.”
You can steer tone later. You can’t conjure knowledge later. Invest in base quality.

“More tokens are always better.”
Not if they’re duplicates or low-value. Clean first; diversify second.

“Web data is too noisy to win.”
It used to be. With modern filtering/dedup, web-only can compete or beat curated blends at scale.

“Long context is an inference hack.”
It’s a training decision. You earn it during pre-training.

If you’re building or buying an LLM, interrogate the pre-training story first.
Ask about corpus composition, dedup and filtering, repetition policy, sequence-length curriculum, and architecture.

Then, and only then, worry about fine-tuning.

Sources