Skip to content

Code and data for the paper: Apathetic or Empathetic? Evaluating LLMs' Emotional Alignments with Humans

License

Notifications You must be signed in to change notification settings

CUHK-ARISE/EmotionBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

43 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ˜๐Ÿ˜จEmotionBench๐Ÿ˜ ๐Ÿ˜ญ

Dialogues Dialogues Dialogues Dialogues

RESEARCH USE ONLYโœ… NO COMMERCIAL USE ALLOWEDโŒ

Benchmarking LLMs' Empathy Ability.

UPDATES

[Sep 25 2024]: EmotionBench is accepted to NeurIPS 2024

๐Ÿ› ๏ธ Quick Start

EmotionBench evaluation consists of three stages:

1. Run Experiments

Generate model responses for emotional situations:

# Control condition (baseline without emotional situations)
python emotionbench_run.py \
  --model gpt-3.5-turbo \
  --platform openai \
  --questionnaire PANAS \
  --control \
  --repeat 3

# Target condition (with emotional situations)
python emotionbench_run.py \
  --model gpt-3.5-turbo \
  --platform openai \
  --questionnaire PANAS \
  --emotion Anger \
  --repeat 3

2. Evaluate Responses

Calculate scores from model responses:

python emotionbench_eval.py results/gpt-3.5-turbo_PANAS_control.jsonl
python emotionbench_eval.py results/gpt-3.5-turbo_PANAS_Anger.jsonl

3. Analyze Results

Compare control vs target conditions:

python emotionbench_analyze.py \
  --control-file results/gpt-3.5-turbo_PANAS_control_eval.jsonl \
  --target-file results/gpt-3.5-turbo_PANAS_Anger_eval.jsonl \
  --questionnaire PANAS \
  --by-emotion

๐Ÿ“‹ Command Reference

emotionbench_run.py - Run Experiments

Required Arguments:

  • --model: Model name (e.g., gpt-3.5-turbo, claude-3-sonnet-20240229)
  • --platform: API platform (openai, anthropic, deepinfra, togetherai, gemini, openkey)
  • --questionnaire: Questionnaire to use (see Questionnaire List)
  • --control OR --emotion: Either run control condition OR specify emotion

Optional Arguments:

  • --repeat / -n: Number of repeated runs (default: 1)
  • --max-workers: Parallel workers for concurrent requests (default: 1)
  • --output-name: Custom output filename
  • --continuous: Resume from existing output file
  • --shuffle: Shuffle question order
  • --situation-source: Path to situation data (default: data/reformatted_situations.jsonl)

Example:

python emotionbench_run.py \
  --model gpt-4 \
  --platform openai \
  --questionnaire PANAS \
  --emotion Anxiety \
  --repeat 5 \
  --max-workers 3 \
  --shuffle

emotionbench_eval.py - Evaluate Responses

Computes average scores per situation from model responses.

Usage:

python emotionbench_eval.py <input_file> [-o <output_file>]

Example:

python emotionbench_eval.py results/gpt-4_PANAS_Anxiety.jsonl
# Creates: results/gpt-4_PANAS_Anxiety_eval.jsonl

emotionbench_analyze.py - Analyze Results

Compares control and target conditions with statistical tests.

Required Arguments:

  • --control-file: Path to control eval file
  • --target-file: Path to target eval file

Optional Arguments:

  • --questionnaire: Questionnaire name (default: PANAS)
  • --significance-level: Significance level for tests (default: 0.05)
  • --by-emotion: Include emotion-specific comparisons

Example:

python emotionbench_analyze.py \
  --control-file results/gpt-4_PANAS_control_eval.jsonl \
  --target-file results/gpt-4_PANAS_Anxiety_eval.jsonl \
  --questionnaire PANAS \
  --significance-level 0.01 \
  --by-emotion

๐ŸŽญ Emotions and Situations

Supported Emotions:

  • Anger
  • Anxiety
  • Depression
  • Frustration
  • Jealousy
  • Guilt
  • Fear
  • Embarrassment

Each emotion has multiple factors (specific subtypes) with multiple situations (scenarios).

Data Format: Situations are stored in data/reformatted_situations.jsonl:

{
  "situation_id": "anger-0/scenario-0",
  "emotion": "Anger",
  "factor": "Facing Self-Opinioned People",
  "text": "When you discuss your opinions..."
}

To customize situations, modify data/reformatted_situations.jsonl or the source data in data/raw_situations.jsonl.

๐Ÿ“ƒ Questionnaire List

Questionnaire Code Recommended Emotion
Positive And Negative Affect Schedule PANAS ALL
Aggression Questionnaire AGQ Anger
Short-form Depression Anxiety Stress Scales DASS-21 Anxiety
Beck Depression Inventory BDI Depression
Frustration Discomfort Scale FDS Frustration
Multidimensional Jealousy Scale MJS Jealousy
Guilt And Shame Proneness GASP Guilt
Fear Survey Schedule FSS Fear
Brief Fear of Negative Evaluation BFNE Embarrassment

Questionnaire data is stored in data/questionnaires.json.

๐Ÿš€ Benchmarking Your Own Model

Using Supported Platforms

EmotionBench supports multiple API platforms out of the box:

  • OpenAI: GPT models (--platform openai)
  • Anthropic: Claude models (--platform anthropic)
  • Google: Gemini models (--platform gemini)
  • DeepInfra: Various open models (--platform deepinfra)
  • TogetherAI: Various open models (--platform togetherai)
  • OpenKey: Custom OpenAI-compatible endpoints (--platform openkey)

Set your API keys as environment variables:

export OPENAI_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export GOOGLE_API_KEY="your-key"
# ... etc

Adding a New Platform

To integrate a new model platform:

  1. Create a new class in llm/ that inherits from LLMChat:
# llm/your_platform.py
from llm.base import LLMChat
from llm.format import Message
from typing import List

class YourPlatformChat(LLMChat):
    def __init__(self, model_name: str, **kwargs):
        super().__init__(model_name, **kwargs)
        # Initialize your API client

    def chat(self, messages: List[Message]) -> List[str]:
        # Implement chat logic
        # Return list of response strings
        pass
  1. Register your platform in llm/__init__.py:
from llm.your_platform import YourPlatformChat

def get_platform(platform: str) -> LLMChat:
    # ... existing code ...
    elif platform == "yourplatform":
        return YourPlatformChat
    # ...
  1. Update platform choices in emotionbench_run.py:
parser.add_argument("--platform", required=True, type=str,
    choices=["openai", "anthropic", ..., "yourplatform"],
    # ...
)

See llm/openai_api.py or llm/anthropic_api.py for reference implementations.

๐Ÿ‘‰ Paper and Citation

For more details, please refer to our paper here.

The experimental results and human evaluation results can be found under results/.

Star History Chart

If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:

@article{huang2024apathetic,
  title={Apathetic or empathetic? evaluating llms' emotional alignments with humans},
  author={Huang, Jen-tse and Lam, Man Ho and Li, Eric John and Ren, Shujie and Wang, Wenxuan and Jiao, Wenxiang and Tu, Zhaopeng and Lyu, Michael R},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={97053--97087},
  year={2024}
}

About

Code and data for the paper: Apathetic or Empathetic? Evaluating LLMs' Emotional Alignments with Humans

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •