In AI, ability isn’t a fixed ceiling. It’s a costume the model puts on. Large language models can present as bright or blunt, fluent or fumbling, depending on the role we hand them. That isn’t mystique. It’s simulation with constraints, and it has consequences for how we build, test, and trust these systems.

A recent study by Jiří Milička and colleagues gives the cleanest look yet. The researchers asked GPT-3.5-turbo and GPT-4 to role-play children ages one through six, then watched whether language complexity and reasoning rose with each birthday. They used three prompt patterns—plain zero-shot, chain-of-thought, and a primed-by-corpus setup—and probed performance on standard false-belief tasks drawn from Theory of Mind research. The point wasn’t to prove genius. It was to see if models could convincingly dial themselves down while staying internally consistent to the persona.

"GPT-4 generally exhibited a closer alignment with the developmental curve observed in ‘real’ children." (Milička et al., 2024)

What did they actually measure? Two tracks moved in step: the correctness of answers on mental-state tasks and the linguistic complexity of the output. As the simulated age increased, both rose predictably. GPT-4 tended to track the human developmental curve more closely and, under certain priming conditions, went hyper-accurate—an important wrinkle when you believe you’ve capped ability by role alone. Temperature tweaks, often treated as a master dial for randomness, didn’t behave consistently as a limiter in this setup. In other words, persona framing and prompt design mattered more than a single numeric knob.

If you’ve worked hands-on with assistants, none of this is shocking. Ask a model to be a meticulous analyst and it will bring citations. Ask it to be a harried intern and it will hedge. The study formalizes that intuition with developmental yardsticks and a replicable recipe—models can downshift cognition to meet the brief, not merely the task. That should sharpen how we interpret benchmarks. We don’t test a model in the abstract. We test a simulated agent produced by a prompt that encodes expectations about competence.

This post is for subscribers only

Sign up now to read the post and get access to the full library of posts for subscribers only.

Sign up now Already have an account? Sign in