LongEmotion

Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

Weichu Liu^1*, Jing Xiong^2*, Yuxuan Hu³, Zixuan Li⁴, Minghuan Tan⁵, Ningning Mao⁶,
Chenyang Zhao⁷, Zhongwei Wan⁸, Chaofan Tao², Wendong Xu², Hui Shen^2,9,
Chengming Li^1†, Lingpeng Kong², Ngai Wong²,

¹Shenzhen MSU-BIT University, ²The University of Hong Kong, ³City University of Hong Kong,
⁴Institute of Automation, Chinese Academy of Sciences, ⁵Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences,
⁶Beijing Normal University, ⁷The University of California, Los Angeles, ⁸The Ohio State University, ⁹University of Michigan

* These authors contributed equally.
† Corresponding authors.

arXiv Code 🤗 Dataset Twitter

An illustrative overview of LongEmotion dataset: Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression.

Introduction

We present LongEmotion, a benchmark specifically designed for long-context EI tasks. It covers a diverse set of tasks, including Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression. On average, the input length for these tasks reaches 8,777 tokens, with long-form generation required for Emotion Expression. To enhance performance under realistic constraints, we incorporate Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them with standard prompt-based methods.Experimental results show that both RAG and CoEM consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward more practical and real-world EI applications.

Distribution of LongEmotion dataset

Left: Token distributions across tasks. For Emotion Expression, the sequence length refers to the average length of model-generated outputs, whereas for the other tasks, it corresponds to the average length of input contexts.
Right: Distribution of sample counts across the six tasks, illustrating the overall composition of the dataset.

Statistical overview of LongEmotion dataset

A statistical overview of the LongEmotion dataset. ID represents the abbreviations of each task. Among them, EC, ED, QA, MC, and ES are long text input tasks, where Avg len refers to the average context length of the entries. EE is a long text generation task, with Avg len indicating models' average generation length. LLM as Judge indicates scoring performed by GPT-4o.

Emotion Classification: This task requires the model to identify the emotional category of a target entity within long-context texts that contain lengthy spans of context-independent noise.
Emotion Detection: The model is given N+1 emotional segments. Among them, N segments express the same emotion, while one segment expresses a unique emotion. The model is required to identify the single distinctive emotional segment.
Emotion QA: In this task, the model is required to answer questions grounded in long-context psychological literature. Model performance is evaluated using the F1 score between its responses and the ground truth answers.
Emotion Conversation: The model is required to act as a psychological counselor and provide empathetic and contextually appropriate emotional support to the patient.
Emotion Summary: In this task, the model is required to summarize the following aspects from long-context psychological pathology reports: (i) causes, (ii) symptoms, (iii) treatment process, (iv) illness characteristics, and (v) treatment effects.
Emotion Expression: In this task, the model is situated within a specific emotional context and prompted to produce a longform emotional self-narrative.

Collaborative Emotional Modeling

The pipeline of Collaborative Emotional Modeling (CoEM)

To address EI tasks involving long contexts, we propose a hybrid retrieval-generation architecture that combines Retrieval-Augmented Generation (RAG) with modular multi-agent collaboration.

Chunking: The input context is segmented into semantically coherent or token-length-constrained chunks. This enables efficient retrieval and minimizes irrelevant content during similarity estimation.
Initial Ranking: A retrieval agent, implemented as CoEM-Rank, evaluates the relevance of each chunk to the query using dense semantic similarity. Top-ranked chunks are passed forward for enhancement.
Multi-Agent Enrichment: A reasoning agent called CoEM-Sage, functioning as a knowledge assistant, enriches the selected chunks by incorporating external knowledge or latent emotional signals. These signals, derived from psychological theories or curated priors, enhance emotional reasoning without introducing task-specific leakage.
Re-Ranking: The enriched chunks are then re-evaluated by CoEM-Rank for both semantic relevance and emotional alignment. This ensures that the final input is both factually grounded and affectively coherent.
Emotional Ensemble Generation: The selected and enriched content, along with the prompt, is fed into a generation model denoted as CoEM-Core. This model (e.g., a longcontext LLM or an instruction-tuned model) produces the final task-specific output, whether it be classification, summarization, or dialogue generation.

Data

You can directly download our data from Huggingface datasets. For guidance on how to access and utilize the data, please consult our instructions on Github.

Experiment Results

We evaluate various models including both closed- and open-source models in Base, RAG, CoEM settings. In the Emotion Classification, Emotion Detection, Emotion QA, and Emotion Expression tasks, we employ GPT-4o as the CoEM-Sage, while Deepeek-V3 is used for the Emotion Conversation-4 and Emotion Summary tasks in the same role. For tasks employing automatic evaluation, we adopt GPT-4o as the evaluator. The results are shown in the following table.

Experiment result across different prompting settings (Base, RAG, CoEM). EC represents Emotion Classification, ED represents Emotion Detection, QA represents Emotion QA, MC-4 represents the fourth stage of Emotion Conversation, ES represents Emotion Summary, and EE represents Emotion Expression.

Method	Model	EC	ED	QA	MC-4	ES	EE	Overall
Base	GPT-4o-mini	28.50	16.42	48.61	3.75	4.14	86.77	56.90
	GPT-4o	51.17	19.12	50.12	3.77	4.19	81.03	60.37
	DeepSeek-V3	44.00	24.51	45.53	3.99	4.28	81.75	60.45
	Qwen3-8B	38.50	18.14	44.75	3.97	4.21	73.40	56.97
	Llama3.1-8B-Instruct	26.17	9.80	45.74	4.00	3.98	75.61	53.83
RAG	GPT-4o-mini	38.33	21.57	50.72	3.78	4.19	80.41	58.88
	GPT-4o	54.67	22.55	51.81	3.80	4.13	79.49	61.46
	DeepSeek-V3	52.17	23.53	50.44	4.34	4.28	81.83	63.92
	Qwen3-8B	39.67	19.12	44.34	4.14	4.20	73.28	57.84
	Llama3.1-8B-Instruct	28.00	11.27	47.04	3.94	3.71	75.16	53.46
CoEM	GPT-4o-mini	48.00	20.59	47.51	3.77	3.91	80.38	58.66
	GPT-4o	52.83	25.00	47.24	3.81	4.02	80.41	60.48
	DeepSeek-V3	54.33	23.04	46.52	4.34	4.12	82.83	63.05
	Qwen3-8B	52.83	18.14	46.31	4.14	4.09	73.59	59.78
	Llama3.1-8B-Instruct	38.17	11.27	44.79	4.00	3.60	75.71	54.53

Real-time Leaderboard

Below is the real-time leaderboard based on submitted results. The leaderboard is sorted by overall score, which is calculated as the average of scores from six tasks. All the results we present here are based on the Base Setting. The leaderboard is constantly updating as we are welcoming new submissions!

Overall Score Calculation Rules:

EC (Emotion Classification): 15% weight
ED (Emotion Detection): 15% weight
QA (Emotion QA): 20% weight
MC-4 (Emotion Conversation): 20% weight × 20 (score conversion)
ES (Emotion Summary): 15% weight × 20 (score conversion)
EE (Emotion Expression): 15% weight

Formula: Overall Score = (EC×0.15) + (ED×0.15) + (QA×0.20) + (MC-4×20×0.20) + (ES×20×0.15) + (EE×0.15)

Note: MC-4 and ES are displayed as 5-point scores but converted to 100-point scale in the calculation.

Comparison of GPT series models

It can be seen that GPT-5's overall capabilities surpass those of GPT-4o and GPT-4o-mini. In the tasks of Emotion Classification and Emotion Detection, we only prompt the models to output the final label. The results show that GPT-5's reasoning ability is significantly better than that of GPT-4o and GPT-4o-mini.

Note that for Emotion Classification and Emotion Detection, models are only prompted to output the final label. As a result, we do not include specific case studies for these tasks. In contrast, the remaining four tasks — Emotion Expression, Emotion Conversation, Emotion Summary, and Emotion QA — require free-form generation from the models. For these, we conducted detailed analyses and case-by-case comparisons to evaluate the models' concrete performance beyond scores in Emotion Intelligence tasks.

Emotion Expression

In the Emotion Expression task, GPT-4o-mini performed more like a real person, with the generated content closely resembling what an actual individual might say in a given situation. In contrast, GPT-4o's expressions were more like a rigidly told story, lacking natural fluidity. Meanwhile, GPT-5's generation was more comprehensive and balanced, providing a well-rounded and objective description of emotions across various features.

Emotion Conversation

In the Emotion Conversation task, GPT-5 achieved higher scores based on our psychology theory-driven metrics. However, by examining the model outputs, we can see that GPT-5 merely makes better use of psychological knowledge to offer advice to the patient, rather than genuinely demonstrating empathy toward the client.

Emotion Summary

In the Emotion Summary task, GPT-4o-mini and GPT-4o directly analyzed various features of the case, whereas GPT-5 structured its analysis based on psychological theories, resulting in a higher score.

Emotion QA

In the Emotion QA task, GPT-4o and GPT-4o-mini tend to respond more literally based on the original text. In contrast, GPT-5 modifies content according to its own understanding, which leads to a lower F1 score.

From the tasks above, we can conclude that GPT-4o-mini behaves more like a human, with richer emotional features, but its application of psychological theory is somewhat lacking. On the other hand, GPT-5 has a better understanding of psychological theories, but the output is too rigid and mechanical, which might lead to a less accurate user experience in practice. GPT-4o strikes a more balanced approach between theoretical understanding and emotional features.

Comprehensive Prompt Collections

This section presents the complete set of prompts used throughout the framework, encompassing Evaluation, Multi-agent Enrichment, and Emotional Ensemble Generation stages across all tasks. For tasks adopting automatic evaluation as the metric, we utilize GPT-4o as the evaluation model. During the Multi-Agent Enrichment stage, task-specific prompts are designed to guide agent collaboration and reasoning. Finally, in the Emotional Ensemble Generation stage, we employ carefully constructed prompts to support emotional diversity and coherence in response generation.