How LLM’s Downplay Intelligence to Match Roles

In AI, ability isn’t a fixed ceiling. It’s a costume the model puts on. Large language models can present as bright or blunt, fluent or fumbling, depending on the role we hand them. That isn’t mystique. It’s simulation with constraints, and it has consequences for how we build, test, and trust these systems.

A recent study by Jiří Milička and colleagues gives the cleanest look yet. The researchers asked GPT-3.5-turbo and GPT-4 to role-play children ages one through six, then watched whether language complexity and reasoning rose with each birthday. They used three prompt patterns—plain zero-shot, chain-of-thought, and a primed-by-corpus setup—and probed performance on standard false-belief tasks drawn from Theory of Mind research. The point wasn’t to prove genius. It was to see if models could convincingly dial themselves down while staying internally consistent to the persona.

"GPT-4 generally exhibited a closer alignment with the developmental curve observed in ‘real’ children." (Milička et al., 2024)

What did they actually measure? Two tracks moved in step: the correctness of answers on mental-state tasks and the linguistic complexity of the output. As the simulated age increased, both rose predictably. GPT-4 tended to track the human developmental curve more closely and, under certain priming conditions, went hyper-accurate—an important wrinkle when you believe you’ve capped ability by role alone. Temperature tweaks, often treated as a master dial for randomness, didn’t behave consistently as a limiter in this setup. In other words, persona framing and prompt design mattered more than a single numeric knob.

If you’ve worked hands-on with assistants, none of this is shocking. Ask a model to be a meticulous analyst and it will bring citations. Ask it to be a harried intern and it will hedge. The study formalizes that intuition with developmental yardsticks and a replicable recipe—models can downshift cognition to meet the brief, not merely the task. That should sharpen how we interpret benchmarks. We don’t test a model in the abstract. We test a simulated agent produced by a prompt that encodes expectations about competence.

How LLM’s Downplay Intelligence to Match Roles

This post is for subscribers only

Others from Analysis

Feel like you write like an AI? You kind of are...

Just Say "No": How AI Politeness Drives up Costs

Copycat Apps: The impact on Innovation, AI Saturation, and Prompts

GEO & AI Search Engines: How AI Sees Content Quality

How LLM’s Downplay Intelligence to Match Roles

This post is for subscribers only

Others from Analysis

Feel like you write like an AI? You kind of are...

Just Say "No": How AI Politeness Drives up Costs

Copycat Apps: The impact on Innovation, AI Saturation, and Prompts

GEO & AI Search Engines: How AI Sees Content Quality

Subscribe to new posts