Ranking LLMs on Cognitive Biases

As Large Language Models (LLMs) are increasingly embedded in real-world decision-making processes, it becomes crucial to examine the extent to which they exhibit cognitive biases in their responses which are systematic distortions commonly observed in human judgment. This platform presents a large-scale evaluation of eight well-established cognitive biases across a diverse set of LLMs, analyzing over 2.8 million of responses generated through controlled prompt variations.

Our evaluation framework is designed to measure model susceptibility to Anchoring, Availability, Confirmation, Framing, Interpretation, Overattribution, Prospect Theory, and Representativeness biases. The analysis investigates how both model size and prompt specificity play a significant role in bias expression. Each model's resistance score is computed from its performance across a curated dataset of psychologist-authored decision scenarios. Higher scores indicate stronger resistance to producing biased output.

By clicking the triangles (►), the table reveals how bias susceptibility changes across prompts with varying levels of detail, as structured by the TELeR Taxonomy . These results provide transparent, empirical insights into model reliability and trustworthiness, highlighting how model choice and thoughtful prompt design can mitigate common reasoning pitfalls. Our paper details the experiments and results.

Select Model Temperature for ranking*. Temperature:


Cognitive Bias Resistance
Representativeness Bias
Interpretation Bias
Anchoring Bias Mitigation
Availability Bias
Confirmation Bias
Framing Bias
Prospect Theory Bias
Fundamental Attribution Error Bias

Rank	LLM	Score

[ * GPT o4 mini was evaluated at a temperature of "1.0", which was the lowest setting available for the model]

If you use these rankings/results in your manuscript, please cite:

    @inproceedings{DBLP:conf/emnlp/SantuF23,
	author = {Shubhra Kanti Karmaker Santu and Dongji Feng},
	editor = {Houda Bouamor and Juan Pino and Kalika Bali},
	title  = {TELeR: {A} General Taxonomy of {LLM} Prompts for Benchmarking Complex Tasks},
    booktitle  = {Findings of the Association for Computational Linguistics: {EMNLP} 2023, Singapore, December 6-10, 2023},
	pages  = {14197--14203},
    publisher  = {Association for Computational Linguistics},
	year   = {2023},
	url    = {https://doi.org/10.18653/v1/2023.findings-emnlp.946},
	doi    = {10.18653/V1/2023.FINDINGS-EMNLP.946},
	timestamp  = {Sun, 06 Oct 2024 21:00:53 +0200},
	biburl = {https://dblp.org/rec/conf/emnlp/SantuF23.bib},
	bibsource  = {dblp computer science bibliography, https://dblp.org}
	}

and cite:

@misc{knipper2025biasdetailsassessmentcognitive,
      title={The Bias is in the Details: An Assessment of Cognitive Bias in LLMs}, 
      author={R. Alexander Knipper and Charles S. Knipper and Kaiqi Zhang and Valerie Sims and Clint Bowers and Santu Karmaker},
      year={2025},
      eprint={2509.22856},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22856}, 
}

This ranking platform is developed by the Bridge-AI Lab, Department of Computer Science, University of Central Florida, as part of ongoing research into fairness, bias mitigation, and trustworthiness in large language models. The evaluations aim to inform the broader AI community by highlighting model behavior under various cognitive bias scenarios. All results are based on systematically designed test sets and transparent scoring criteria.