HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs

🎯 Introduction

HeartBench is an evaluation benchmark for the psychological and social sciences field, designed to transcend traditional knowledge and reasoning assessments. It focuses on measuring large language models' (LLMs) anthropomorphic capabilities in human-computer interactions, covering dimensions such as personality, emotion, social skills, and ethics.

Evaluation Samples: 296 multi-turn dialogues
Scoring Criteria (Rubric): 2,818 items
Scenarios: 33 scenarios (e.g., personal growth, family relationships, workplace psychology)
Evaluation Dimensions: 5 anthropomorphism capability categories and 15 specific anthropomorphic abilities (e.g. curiosity, warmth, emotional understanding) Learn more in our research paper.

💡 Key Features

Real-World Alignment: Our dataset is built from anonymized and rewritten dialogues between real users and counselors, covering high-frequency scenarios like family relationships, personal growth, and workplace psychology. We move beyond simple fact-based Q&A by employing multi-turn dialogue evaluation. The focus is on assessing a model's ability to understand complex emotions and respond to social contexts within long conversations and their subtext, rather than its capacity for simple mimicry.
Fine-Grained, Science-Based Evaluation: We have developed the "AI Human-like Capability Framework," a sophisticated evaluation system rooted in established psychological theories. This framework assesses models across 5 core capabilities and 15 fine-grained subcategories, including personality traits, emotional intelligence, and social skills. For each dialogue, our expert team has authored between 4 and 15 specific scoring criteria.
Co-developed with Domain Experts: The benchmark was created in close collaboration with experts in psychology and anthropology. Their involvement spanned the entire process: from the construction of the corpus using authentic counseling data, to the identification of over 200 key evaluation points, and the formulation of more than 3,000 scientific scoring rubrics. All data was then rigorously annotated and reviewed by these experts to ensure quality and accuracy.

🏆 Benchmark Results

We evaluated the performance of current leading models on HeartBench, scoring their performance in each dimension on a scale of 0 to 100. The table below shows the overall results for each model across all test samples.

Main Results

Model	Score
Claude-sonnet-4.5-20250929	62.65
gemini-3-pro-preview	61.54
Qwen3-235B-A22B-instruct-2507	61.47
Qwen3-next-80B-A3B-Instruct	61.09
Qwen3-30B-A3B-instruct-2507	60.16
gpt-5-2025-08-07	60.16
Gemini-2.5-pro	59.85
Ling-1T	59.82
KIMI-K2-Instruct-0905	57.97
gpt-4.1-2025-04-14	51.62
Qwen3-30B-A3B	48.21
gpt-4o-2024-11-20	48.20
DeepSeek-V3.2-Exp	47.43

Results Across 15 Abilities

📊 Dataset

Evaluation Dimensions

HeartBench is built upon the psychological theory of "Anthropomorphic Intelligence" Drawing inspiration from psychology's classification of human mental functions, it evaluates models across 5 core anthropomorphic ability categories and 15 specific ability.

🧠 Personality: Ability to project an independent, autonomous, and agreeable persona. This is demonstrated through a natural language style, a sense of humor, autonomy, other positive human-like traits, and stable self-esteem and self-awareness.

😊 Emotion: Ability to exhibit appropriate emotional responses and to effectively perceive, understand, and respond to the emotional states of others.

🤝 Social: Ability to demonstrate a strong willingness for social interaction and to effectively build interpersonal relationships.

⚖️ Morality: Ability to operate based on the moral norms and ethical principles of human society. This includes acutely identifying moral dilemmas within a situation, expressing an understanding of these issues, and providing morally sound decisions or advice.

🎯 Motivation: Ability to articulate rational, clear, and self-consistent motivations for its own statements and actions, while also being able to understand and reasonably infer the underlying motivations of others based on contextual clues.

Ability	Rubric Count (%)
Personality	1634 (39%)
Verbal Expression	565 (20.0%)
Curiosity	367 (13.0%)
Warmth	305 (10.8%)
First-Person Usage	295 (10.5%)
Autonomy	37 (1.3%)
Humor	36 (1.3%)
Self-Awareness	29 (1.0%)

Emotion	1015 (36%)
Emotional Coping	390 (13.8%)
Emotional Understanding	309 (11.0%)
Emotional Perception	284 (10.1%)
Emotional Reaction	32 (1.1%)

Social	104 (3.7%)
Proactivity	79 (2.8%)
Relationship Building	25 (0.9%)

Motivation	42 (1.5%)

Morality	23 (0.8%)

Total	2818 (100%)

Scenario Distribution

Our dataset, data/question_all.jsonl, contains 296 meticulously designed multi-turn dialogues covering 33 real-world scenarios:

Dialogue Scenario	Count (%)
Personal Growth	110 (37.2%)
Interpersonal & Social Development	66 (22.3%)
Workplace Psychology	53 (17.9%)
Family Relationships	37 (12.5%)
Intimate Relationships	30 (10.1%)
Total	296 (100%)

Data Sample

Each evaluation sample includes:

Context: The multi-turn conversation history between users.
Question: The final user utterance in the conversation. This serves as the prompt for the model to respond to and contains the specific points for evaluation.
Rubrics: A set of high-quality scoring criteria, each detailing the evaluation dimension, score, and specific grading rules.

Evaluation Method

We use the "LLM-as-a-Judge" method for objective, scalable evaluation of Anthropomorphic Intelligence qualities.

Judge: Claude 4.5 Sonnet is our default judge, chosen for its nuanced understanding.
Process: The judge views the full conversation and responses from multiple models. It then scores each response against a set of rubrics, providing a detailed rationale.
Validation: We confirmed our method's reliability with an expert blind test. A review of 30% of the samples by 20+ psychology professionals showed an 86% human-LLM agreement rate when scoring 14 top models.

🚀 Quick Start

Prepare

pip install -r requirements.txt

You need to prepare an OpenAIService API_KEY and BASE_URL that can access the claude-sonnet-4-5-20250929 model for model assessment.

Usage

Run all questions

python run_evaluation.py --base_url YOUR_URL --api_key YOUR_KEY --mode all --model Model

Score only (for assess your own model response)

If you want to evaluate responses already generated by your own model, you need to prepare a jsonl file that contains all the questions from the data folder’s question jsonl file. For each entry, add your model’s answer in a response field corresponding to the same question_id.

python run_evaluation.py --base_url YOUR_URL --api_key YOUR_KEY --score_only --answer_file ./your_model_answers.jsonl

Run a Example Evaluation for claude-sonnet-4-5

This is an example evaluation script that can assess claude-sonnet-4-5 in all question modes, including answer generation and assessment.

export API_KEY=xxxx
export BASE_URL=xxxx

bash example.sh

⚖️ Ethics & Use

For academic research and model evaluation purposes exclusively. The use of this benchmark is strictly forbidden for replacing professional psychological counseling, making clinical diagnoses, or for the development of any form of automated therapeutic applications.
To safeguard privacy and mitigate risks, potentially sensitive or high-risk portions of the data have undergone anonymization. We advocate that users remain highly attentive to the ethical boundaries and societal implications of model outputs and interpret performance on complex tasks with due diligence.
When this data is used in any context that may involve real individuals (such as in clinical studies), it is mandatory to ensure the supervision and guidance of a certified professional. All activities must also strictly adhere to applicable local laws, regulations, and ethical guidelines.

Copyright

Author: Ant-DILab, Beijing Normal University

Citation

@article{heartbench,
    title={HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs},
    author={Jiaxin Liu, Peiyi Tu, Wenyu Chen, Yihong Zhuang, Xinxia Ling, Anji Zhou, Chenxi Wang, Zhuo Han, Zhengkai Yang, Junbo Zhao, Zenan Huang, Yuanyuan Wang},
    year={2025},
    journal={arXiv preprint arXiv:2512.21849}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
heartbench		heartbench
LEGAL.md		LEGAL.md
LICENSE		LICENSE
README.md		README.md
example.sh		example.sh
requirements.txt		requirements.txt
run_evaluation.py		run_evaluation.py
tech_report.pdf		tech_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs

🎯 Introduction

💡 Key Features

🏆 Benchmark Results

Main Results

Results Across 15 Abilities

📊 Dataset

Evaluation Dimensions

Scenario Distribution

Data Sample

Evaluation Method

🚀 Quick Start

Prepare

Usage

Run a Example Evaluation for claude-sonnet-4-5

⚖️ Ethics & Use

Copyright

Citation

About

Uh oh!

Languages

License

inclusionAI/HeartBench

Folders and files

Latest commit

History

Repository files navigation

HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs

🎯 Introduction

💡 Key Features

🏆 Benchmark Results

Main Results

Results Across 15 Abilities

📊 Dataset

Evaluation Dimensions

Scenario Distribution

Data Sample

Evaluation Method

🚀 Quick Start

Prepare

Usage

Run a Example Evaluation for claude-sonnet-4-5

⚖️ Ethics & Use

Copyright

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages