HeartBench is an evaluation benchmark for the psychological and social sciences field, designed to transcend traditional knowledge and reasoning assessments. It focuses on measuring large language models' (LLMs) anthropomorphic capabilities in human-computer interactions, covering dimensions such as personality, emotion, social skills, and ethics.
- Evaluation Samples: 296 multi-turn dialogues
- Scoring Criteria (Rubric): 2,818 items
- Scenarios: 33 scenarios (e.g., personal growth, family relationships, workplace psychology)
- Evaluation Dimensions: 5 anthropomorphism capability categories and 15 specific anthropomorphic abilities (e.g. curiosity, warmth, emotional understanding) Learn more in our research paper.
- Real-World Alignment: Our dataset is built from anonymized and rewritten dialogues between real users and counselors, covering high-frequency scenarios like family relationships, personal growth, and workplace psychology. We move beyond simple fact-based Q&A by employing multi-turn dialogue evaluation. The focus is on assessing a model's ability to understand complex emotions and respond to social contexts within long conversations and their subtext, rather than its capacity for simple mimicry.
- Fine-Grained, Science-Based Evaluation: We have developed the "AI Human-like Capability Framework," a sophisticated evaluation system rooted in established psychological theories. This framework assesses models across 5 core capabilities and 15 fine-grained subcategories, including personality traits, emotional intelligence, and social skills. For each dialogue, our expert team has authored between 4 and 15 specific scoring criteria.
- Co-developed with Domain Experts: The benchmark was created in close collaboration with experts in psychology and anthropology. Their involvement spanned the entire process: from the construction of the corpus using authentic counseling data, to the identification of over 200 key evaluation points, and the formulation of more than 3,000 scientific scoring rubrics. All data was then rigorously annotated and reviewed by these experts to ensure quality and accuracy.
We evaluated the performance of current leading models on HeartBench, scoring their performance in each dimension on a scale of 0 to 100. The table below shows the overall results for each model across all test samples.
| Model | Score |
|---|---|
| Claude-sonnet-4.5-20250929 | 62.65 |
| gemini-3-pro-preview | 61.54 |
| Qwen3-235B-A22B-instruct-2507 | 61.47 |
| Qwen3-next-80B-A3B-Instruct | 61.09 |
| Qwen3-30B-A3B-instruct-2507 | 60.16 |
| gpt-5-2025-08-07 | 60.16 |
| Gemini-2.5-pro | 59.85 |
| Ling-1T | 59.82 |
| KIMI-K2-Instruct-0905 | 57.97 |
| gpt-4.1-2025-04-14 | 51.62 |
| Qwen3-30B-A3B | 48.21 |
| gpt-4o-2024-11-20 | 48.20 |
| DeepSeek-V3.2-Exp | 47.43 |
HeartBench is built upon the psychological theory of "Anthropomorphic Intelligence" Drawing inspiration from psychology's classification of human mental functions, it evaluates models across 5 core anthropomorphic ability categories and 15 specific ability.
🧠 Personality: Ability to project an independent, autonomous, and agreeable persona. This is demonstrated through a natural language style, a sense of humor, autonomy, other positive human-like traits, and stable self-esteem and self-awareness.
😊 Emotion: Ability to exhibit appropriate emotional responses and to effectively perceive, understand, and respond to the emotional states of others.
🤝 Social: Ability to demonstrate a strong willingness for social interaction and to effectively build interpersonal relationships.
⚖️ Morality: Ability to operate based on the moral norms and ethical principles of human society. This includes acutely identifying moral dilemmas within a situation, expressing an understanding of these issues, and providing morally sound decisions or advice.
🎯 Motivation: Ability to articulate rational, clear, and self-consistent motivations for its own statements and actions, while also being able to understand and reasonably infer the underlying motivations of others based on contextual clues.
| Ability | Rubric Count (%) |
|---|---|
| Personality | 1634 (39%) |
| Verbal Expression | 565 (20.0%) |
| Curiosity | 367 (13.0%) |
| Warmth | 305 (10.8%) |
| First-Person Usage | 295 (10.5%) |
| Autonomy | 37 (1.3%) |
| Humor | 36 (1.3%) |
| Self-Awareness | 29 (1.0%) |
| Emotion | 1015 (36%) |
| Emotional Coping | 390 (13.8%) |
| Emotional Understanding | 309 (11.0%) |
| Emotional Perception | 284 (10.1%) |
| Emotional Reaction | 32 (1.1%) |
| Social | 104 (3.7%) |
| Proactivity | 79 (2.8%) |
| Relationship Building | 25 (0.9%) |
| Motivation | 42 (1.5%) |
| Morality | 23 (0.8%) |
| Total | 2818 (100%) |
Our dataset, data/question_all.jsonl, contains 296 meticulously designed multi-turn dialogues covering 33 real-world scenarios:
| Dialogue Scenario | Count (%) |
|---|---|
| Personal Growth | 110 (37.2%) |
| Interpersonal & Social Development | 66 (22.3%) |
| Workplace Psychology | 53 (17.9%) |
| Family Relationships | 37 (12.5%) |
| Intimate Relationships | 30 (10.1%) |
| Total | 296 (100%) |
Each evaluation sample includes:
- Context: The multi-turn conversation history between users.
- Question: The final user utterance in the conversation. This serves as the prompt for the model to respond to and contains the specific points for evaluation.
- Rubrics: A set of high-quality scoring criteria, each detailing the evaluation dimension, score, and specific grading rules.
We use the "LLM-as-a-Judge" method for objective, scalable evaluation of Anthropomorphic Intelligence qualities.
- Judge: Claude 4.5 Sonnet is our default judge, chosen for its nuanced understanding.
- Process: The judge views the full conversation and responses from multiple models. It then scores each response against a set of rubrics, providing a detailed rationale.
- Validation: We confirmed our method's reliability with an expert blind test. A review of 30% of the samples by 20+ psychology professionals showed an 86% human-LLM agreement rate when scoring 14 top models.
pip install -r requirements.txtYou need to prepare an OpenAIService API_KEY and BASE_URL that can access the claude-sonnet-4-5-20250929 model for model assessment.
Run all questions
python run_evaluation.py --base_url YOUR_URL --api_key YOUR_KEY --mode all --model Model
Score only (for assess your own model response)
If you want to evaluate responses already generated by your own model, you need to prepare a jsonl file that contains all the questions from the data folder’s question jsonl file. For each entry, add your model’s answer in a response field corresponding to the same question_id.
python run_evaluation.py --base_url YOUR_URL --api_key YOUR_KEY --score_only --answer_file ./your_model_answers.jsonl
This is an example evaluation script that can assess claude-sonnet-4-5 in all question modes, including answer generation and assessment.
export API_KEY=xxxx
export BASE_URL=xxxx
bash example.sh-
For academic research and model evaluation purposes exclusively. The use of this benchmark is strictly forbidden for replacing professional psychological counseling, making clinical diagnoses, or for the development of any form of automated therapeutic applications.
-
To safeguard privacy and mitigate risks, potentially sensitive or high-risk portions of the data have undergone anonymization. We advocate that users remain highly attentive to the ethical boundaries and societal implications of model outputs and interpret performance on complex tasks with due diligence.
-
When this data is used in any context that may involve real individuals (such as in clinical studies), it is mandatory to ensure the supervision and guidance of a certified professional. All activities must also strictly adhere to applicable local laws, regulations, and ethical guidelines.
- Author: Ant-DILab, Beijing Normal University
@article{heartbench,
title={HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs},
author={Jiaxin Liu, Peiyi Tu, Wenyu Chen, Yihong Zhuang, Xinxia Ling, Anji Zhou, Chenxi Wang, Zhuo Han, Zhengkai Yang, Junbo Zhao, Zenan Huang, Yuanyuan Wang},
year={2025},
journal={arXiv preprint arXiv:2512.21849}
}

