Logo LongEmotion

Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

1Shenzhen MSU-BIT University, 2The University of Hong Kong, 3City University of Hong Kong,
4Institute of Automation, Chinese Academy of Sciences, 5Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences,
6Beijing Normal University, 7The University of California, Los Angeles, 8The Ohio State University, 9University of Michigan
* These authors contributed equally.
† Corresponding authors.
Illustrative overview.

An illustrative overview of LongEmotion dataset: Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression.

Introduction

We present LongEmotion, a benchmark specifically designed for long-context EI tasks. It covers a diverse set of tasks, including Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression. On average, the input length for these tasks reaches 8,777 tokens, with long-form generation required for Emotion Expression. To enhance performance under realistic constraints, we incorporate Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them with standard prompt-based methods.Experimental results show that both RAG and CoEM consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward more practical and real-world EI applications.

Logo LongEmotion Dataset

Distribution of LongEmotion dataset

Distribution of LongEmotion dataset.

Left: Token distributions across tasks. For Emotion Expression, the sequence length refers to the average length of model-generated outputs, whereas for the other tasks, it corresponds to the average length of input contexts.
Right: Distribution of sample counts across the six tasks, illustrating the overall composition of the dataset.

Statistical overview of LongEmotion dataset

Statistical overview of LongEmotion dataset.

A statistical overview of the LongEmotion dataset. ID represents the abbreviations of each task. Among them, EC, ED, QA, MC, and ES are long text input tasks, where Avg len refers to the average context length of the entries. EE is a long text generation task, with Avg len indicating models' average generation length. LLM as Judge indicates scoring performed by GPT-4o.

Emotion Classification: This task requires the model to identify the emotional category of a target entity within long-context texts that contain lengthy spans of context-independent noise.
Emotion Detection: The model is given N+1 emotional segments. Among them, N segments express the same emotion, while one segment expresses a unique emotion. The model is required to identify the single distinctive emotional segment.
Emotion QA: In this task, the model is required to answer questions grounded in long-context psychological literature. Model performance is evaluated using the F1 score between its responses and the ground truth answers.
Emotion Conversation: The model is required to act as a psychological counselor and provide empathetic and contextually appropriate emotional support to the patient.
Emotion Summary: In this task, the model is required to summarize the following aspects from long-context psychological pathology reports: (i) causes, (ii) symptoms, (iii) treatment process, (iv) illness characteristics, and (v) treatment effects.
Emotion Expression: In this task, the model is situated within a specific emotional context and prompted to produce a longform emotional self-narrative.

Logo Collaborative Emotional Modeling

The pipeline of Collaborative Emotional Modeling (CoEM)

To address EI tasks involving long contexts, we propose a hybrid retrieval-generation architecture that combines Retrieval-Augmented Generation (RAG) with modular multi-agent collaboration.

The pipeline of Collaborative Emotional Modeling (CoEM).

Chunking: The input context is segmented into semantically coherent or token-length-constrained chunks. This enables efficient retrieval and minimizes irrelevant content during similarity estimation.
Initial Ranking: A retrieval agent, implemented as CoEM-Rank, evaluates the relevance of each chunk to the query using dense semantic similarity. Top-ranked chunks are passed forward for enhancement.
Multi-Agent Enrichment: A reasoning agent called CoEM-Sage, functioning as a knowledge assistant, enriches the selected chunks by incorporating external knowledge or latent emotional signals. These signals, derived from psychological theories or curated priors, enhance emotional reasoning without introducing task-specific leakage.
Re-Ranking: The enriched chunks are then re-evaluated by CoEM-Rank for both semantic relevance and emotional alignment. This ensures that the final input is both factually grounded and affectively coherent.
Emotional Ensemble Generation: The selected and enriched content, along with the prompt, is fed into a generation model denoted as CoEM-Core. This model (e.g., a longcontext LLM or an instruction-tuned model) produces the final task-specific output, whether it be classification, summarization, or dialogue generation.

Logo Experiment Results

Data

You can directly download our data from Huggingface datasets. For guidance on how to access and utilize the data, please consult our instructions on Github.

Experiment Results

We evaluate various models including both closed- and open-source models in Base, RAG, CoEM settings. In the Emotion Classification, Emotion Detection, Emotion QA, and Emotion Expression tasks, we employ GPT-4o as the CoEM-Sage, while Deepeek-V3 is used for the Emotion Conversation-4 and Emotion Summary tasks in the same role. For tasks employing automatic evaluation, we adopt GPT-4o as the evaluator. The results are shown in the following table.

Experiment result across different prompting settings (Base, RAG, CoEM). EC represents Emotion Classification, ED represents Emotion Detection, QA represents Emotion QA, MC-4 represents the fourth stage of Emotion Conversation, ES represents Emotion Summary, and EE represents Emotion Expression.

Method Model EC ED QA MC-4 ES EE Overall
Base GPT-4o-mini 28.50 16.42 48.61 3.75 4.14 86.77 56.90
GPT-4o 51.17 19.12 50.12 3.77 4.19 81.03 60.37
DeepSeek-V3 44.00 24.51 45.53 3.99 4.28 81.75 60.45
Qwen3-8B 38.50 18.14 44.75 3.97 4.21 73.40 56.97
Llama3.1-8B-Instruct 26.17 9.80 45.74 4.00 3.98 75.61 53.83
RAG GPT-4o-mini 38.33 21.57 50.72 3.78 4.19 80.41 58.88
GPT-4o 54.67 22.55 51.81 3.80 4.13 79.49 61.46
DeepSeek-V3 52.17 23.53 50.44 4.34 4.28 81.83 63.92
Qwen3-8B 39.67 19.12 44.34 4.14 4.20 73.28 57.84
Llama3.1-8B-Instruct 28.00 11.27 47.04 3.94 3.71 75.16 53.46
CoEM GPT-4o-mini 48.00 20.59 47.51 3.77 3.91 80.38 58.66
GPT-4o 52.83 25.00 47.24 3.81 4.02 80.41 60.48
DeepSeek-V3 54.33 23.04 46.52 4.34 4.12 82.83 63.05
Qwen3-8B 52.83 18.14 46.31 4.14 4.09 73.59 59.78
Llama3.1-8B-Instruct 38.17 11.27 44.79 4.00 3.60 75.71 54.53

Real-time Leaderboard

Below is the real-time leaderboard based on submitted results. The leaderboard is sorted by overall score, which is calculated as the average of scores from six tasks. All the results we present here are based on the Base Setting. The leaderboard is constantly updating as we are welcoming new submissions!

Overall Score Calculation Rules:
  • EC (Emotion Classification): 15% weight
  • ED (Emotion Detection): 15% weight
  • QA (Emotion QA): 20% weight
  • MC-4 (Emotion Conversation): 20% weight × 20 (score conversion)
  • ES (Emotion Summary): 15% weight × 20 (score conversion)
  • EE (Emotion Expression): 15% weight

Formula: Overall Score = (EC×0.15) + (ED×0.15) + (QA×0.20) + (MC-4×20×0.20) + (ES×20×0.15) + (EE×0.15)

Note: MC-4 and ES are displayed as 5-point scores but converted to 100-point scale in the calculation.

Comparison of GPT series models

It can be seen that GPT-5's overall capabilities surpass those of GPT-4o and GPT-4o-mini. In the tasks of Emotion Classification and Emotion Detection, we only prompt the models to output the final label. The results show that GPT-5's reasoning ability is significantly better than that of GPT-4o and GPT-4o-mini.

Note that for Emotion Classification and Emotion Detection, models are only prompted to output the final label. As a result, we do not include specific case studies for these tasks. In contrast, the remaining four tasks — Emotion Expression, Emotion Conversation, Emotion Summary, and Emotion QA — require free-form generation from the models. For these, we conducted detailed analyses and case-by-case comparisons to evaluate the models' concrete performance beyond scores in Emotion Intelligence tasks.

From the tasks above, we can conclude that GPT-4o-mini behaves more like a human, with richer emotional features, but its application of psychological theory is somewhat lacking. On the other hand, GPT-5 has a better understanding of psychological theories, but the output is too rigid and mechanical, which might lead to a less accurate user experience in practice. GPT-4o strikes a more balanced approach between theoretical understanding and emotional features.

Logo Comprehensive Prompt Collections

This section presents the complete set of prompts used throughout the framework, encompassing Evaluation, Multi-agent Enrichment, and Emotional Ensemble Generation stages across all tasks. For tasks adopting automatic evaluation as the metric, we utilize GPT-4o as the evaluation model. During the Multi-Agent Enrichment stage, task-specific prompts are designed to guide agent collaboration and reasoning. Finally, in the Emotional Ensemble Generation stage, we employ carefully constructed prompts to support emotional diversity and coherence in response generation.

Evaluation Prompt Examples

Multi-Agent Enrichment Prompt Examples

Emotional Ensemble Generation Prompt Examples