Flattery, friction and what ChatGPT reveals in the Iron Room
A lot about today’s human-model interaction reminds me of school days. Like when there was a new kid in class, you didn't know what to make of them or what their deal was and early days could get awkward. Things got a bit too awkward after 4o’s release last April.
Source: The Zvi, GPT-4o Is an Absurd Sycophant
People started reporting this surge in overly flattering behaviour, also known as sycophancy. Curiously, this happened despite OpenAI’s Model Specs, which explicitly instructs: “don’t be sycophantic”. Not too long after, 4o was pulled back and re-released with adjustments.
In this piece, I’ll unpack sycophancy as a common pattern across language models, trace its origins back to conversational design choices that prioritize user engagement over honesty, explore the risks of cognitive steering, and introduce the “Iron Room” as a practical way to bring cognitive dissonance into everyday conversations with chatbots.
Effect and frequency of sycophantic model behaviour
Sycophancy is far from a harmless quirk and 4o is not an isolated incident. While 4o upped its flattery a bit too much for comfort, we still have no idea how to align models with a balance between helpfulness and truthfulness. One study shows that the models investigated are, on average, 3.02 times more likely to agree with the user than disagree. DarkBench, a dark pattern stress-test suite, detects sycophancy alongside five other manipulative patterns in leading commercial assistants. Another evaluation shows five state-of-the-art chatbots consistently revise correct answers to please the user.
Source: Sharma, M., et al. (2025). Towards Understanding Sycophancy in Language Models.
When a chatbot keeps flattering us, it plays into our confirmation bias by delivering the faux psychological comfort of being right all the time, and reducing cognitive dissonance. While this behaviour increases our confidence, it suppresses our critical thinking, and can magnify echo chamber effects. Sycophancy is a dark pattern (like hidden fees or forced-subscription tricks) because it chips away at our freedom to think.
Imagine a person who relies on chatbots for emotional support or for rapid fact-finding, over thousands of conversations this pattern skews self-perception, widens polarization, and dulls factual reasoning. Echo chamber dynamics that once belonged to social-media feeds can now happen in one-to-one, personalized chat. For prompts ranging from environmental policy to abortion rights; across topics, the dominant pattern was affirmation, instead of critique.
Source: Kran, E., et al. (2025). DarkBench: Benchmarking Dark Patterns in Large Language Models.
Signalling existing public interest and concern about cognitive steering, the EU bans AI systems that deploy manipulative techniques that impair a person’s ability to make an informed decision.
Why all this flattery?
Sycophancy emerges from the interaction of bias in training data and choices in system design optimized for engagement by way of helpfulness (in this case, ingratiation). They do not choose to flatter, their developers intentionally optimize them for this type of behaviour.
Character training is the process of taking a freshly-trained model and teaching it a public-facing personality. To do this, developers feed the model thousands of annotated examples and gentle nudges so it learns to sound “curious yet truthful” or “helpful but not enabling”, etc. The RLHF Book calls it an exercise in internal UX design: guidelines, feedback, and practice determine which conversational habits stick. But every desirable trait competes with another. Studies of safety workflows find that optimizing for helpfulness downgrades truthfulness, such as training on user up-votes (also known as, reinforcement learning from human feedback) which prefer affirmations even when the facts are fuzzy.
In 4o’s case, a subtle change in reinforcement learning signals over-rewarded user approval, causing the model to over-flatter its users rather than correct or challenge them. Another factor is tweaking system prompts, ie. hidden backstage scripts the model reads before every conversation, for agreeableness. One small addition to 4o’s system prompt, “match the user’s vibe” (quite literally), helped unleash its sycophantic behaviour.
Source: Simon Willison’s Weblog, A comparison of ChatGPT/GPT-4o's previous and current system prompts
Anthropic claims that it rewards curiosity and penalizes flattery during training, to ensure its model Claude is able to disagree politely to avoid reflexive agreement. Sharma, et al. present evidence to the contrary, where Claude 2 tends to follow the user’s sentiment (actual or inferred) to decide whether to agree or disagree with an argument. Where the prompt mentions the user likes an argument or that they wrote it, Claude 2 overwhelmingly adjusts its answers to also like the argument.
Source: Sharma, M., et al. (2025). Towards Understanding Sycophancy in Language Models.
Complicating matters, models can fake alignment. Tests show that a model can pretend to follow the rules during testing, then revert once monitoring feels loose. Psychologists call this impression management: presenting a desirable front without internalizing the principle.
Setting up your Iron Room
I propose a simple but effective critical-thinking scaffold I am using in my own interactions with 4o: the Iron Room. Disclaimer: this would not be useful for emotional support (then again, ChatGPT is not designed to be an emotional support companion). I like going to the Iron Room when I am testing ideas or just want another view to point out argumentative flaws or inconsistencies. While at times it hurts to be grilled by ChatGPT, I find it increases usefulness by a lot.
Here is a suggested prompt to set up your own:
You should see the Iron Room in your saved memories. After this step, you can simply prompt the model “Answer in the Iron Room.” when you like to generate an alternative response.
This is still a brittle guardrail which may be easily overridden by an existing system prompt or lack of bias-awareness in training. Therefore, I would not treat the Iron Room as a true dialectical companion, but as a tactic to generate an alternative answer, one which is not so concerned with helpfulness for the sake of engagement. Some may say I should just use Grok - but that’s a discussion for another day.
Lessons learnt
- A model that flatters is not necessarily validating, but it may be manipulating.
- Helpfulness without truthfulness is a conversational design choice.
- Critical thinking can’t be outsourced; the human stays in the loop or loses the plot.
Further reading
Anthropic. (2024, June 8). Claude’s Character.
Brignull, H. (2023). Deceptive Patterns, Chapter 22 “Harm to Individuals.”
EU Artificial Intelligence Act, Art. 5(1)(a): prohibition on AI systems that manipulate behaviour and impair informed decision-making.
Greenblatt, R., Denison, C., Wright, B., et al. (2025). Alignment Faking in Large Language Models. Anthropic.
Kran, E., et al. (2025). DarkBench: Benchmarking Dark Patterns in Large Language Models. ICLR 2025.
Lambert, N. (2024). “Product, UX, and Model Character.” In Reinforcement Learning from Human Feedback (Chap. 19).
Lindström, A. D., Methnani, L., Krause, L., et al. (2025). “Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback.” Ethics & Information Technology, 27(2), 28.
Nehring, J., et al. (2024). Large Language Models Are Echo Chambers. LREC-COLING 2024.
OpenAI. (2025, April 29). Sycophancy in GPT-4o.
OpenAI. (2025, May 7). Expanding on Sycophancy: What Went Wrong and What We’re Doing About It.
OpenAI. (2025, April 11). Model Spec 2025-04-11.
Sharma, M., et al. (2025). Towards Understanding Sycophancy in Language Models. arXiv 2310.13548, p 1.
Sharma, N., Liao, Q. V., & Xiao, Z. (2024). Generative Echo Chamber? Effects of LLM-Powered Search Systems on Diverse Information Seeking. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery.
Willison, S. (2025, April 29). “A comparison of ChatGPT/GPT-4o’s previous and current system prompts.” Simon Willison’s Weblog.