SEADialogues is a new benchmark dataset designed to address the lack of cultural nuance in existing dialogue systems. It features 32,000 multi-turn, persona-rich conversations across 8 Southeast Asian languages. By integrating manually curated cultural knowledge directly into the generation pipeline, SEADIALOGUES provides a high-quality resource for building and evaluating more inclusive, culturally aware, and human-like conversational AI.
Current dialogue systems and datasets often overlook the cultural nuances that are essential for natural human conversation. Many large language models (LLMs) struggle to reflect cultural values accurately or generate contextually relevant, culture-specific references. This is a significant gap, especially for culturally diverse regions like Southeast Asia, which is home to over 700 million people and hundreds of languages.
SEADIALOGUES was created to address these challenges by providing a dataset that is:
- Culturally Grounded: Dialogues are built on a foundation of manually curated cultural knowledge, including local entities, foods, and communication norms relevant to Southeast Asia.
- Persona-Rich: The dataset includes 210 diverse personas to facilitate the generation of personalized and engaging conversations.
- Multilingual: It covers 8 languages from 6 Southeast Asian countries, many of which are low-resource: Indonesian, Javanese, Minangkabau, Thai, Malay, Vietnamese, Tamil, and Tagalog.
- Multi-Turn & Multi-Topic: Conversations are designed to be dynamic, often shifting between two distinct topics to better mimic real-world interactions.
A comparison showing how SEADIALOGUES differs from existing dialogue datasets, highlighting its unique focus on cultural relevance.
SEADIALOGUES is the first dataset of its kind to explicitly represent cultural aspects within each conversation.
| Feature | Count |
|---|---|
| Languages | 8 |
| Dialogues | 32,000 |
| Personas | 210 |
| Topics | 100 |
| Scenarios | 300 |
| Avg. Utterances per Dialogue | 13.86 |
| Avg. Words per Utterance | 21.69 |
The dataset was constructed using a meticulous four-stage pipeline to ensure high-quality, culturally appropriate dialogues.
- Template Generation: Abstract templates for scenarios and personas were created with delexicalized placeholders (e.g.,
[COUNTRY],[FOOD]). Over 100 topics and 210 personas were used to generate scenarios. - Lexicalization: The templates were populated with culturally relevant entities from a curated knowledge pool. This step ensures that entities are contextually and linguistically aligned with the target language and culture (e.g., pairing "Jakarta" with "Indonesia").
- Dialogue Generation: LLMs were prompted with the lexicalized templates, personas, and specific instructions to generate multi-turn dialogues that transition smoothly between topics.
- Annotation & Evaluation: Generated dialogues underwent a rigorous, two-fold evaluation process:
- Human Annotation: Native speakers assessed dialogues on fluency, coherence, engagingness, naturalness, and cultural relevance.
- Automatic Evaluation: LLM-as-a-judge methods, including G-Eval, M-Prometheus, and R3, were used to provide scalable, quantitative quality metrics.
This codebase is intended for reproducing our experiments. It consists of the following components:
codes/correlation_analysis: Code for calculating the alignment between human and automatic evaluations.codes/dialogue_generation: Code for generating multi-turn dialogues.codes/g_eval: Code for automatic evaluation using the G-Eval method.codes/reward_model: Code for automatic evaluation using the M-Prometheus and R3 reward model methods.data: Contains the prerequisite data required to create prompts for dialogue generation.output: Contains the generated prompts used for dialogue generation.results: Contains the generated multi-turn dialogues. This directory also includes the results from our human evaluations, automatic evaluations, and alignment analysis.
Here is the revised documentation that explains each component in detail, using the examples you provided.
This guide outlines the steps to generate your own dialogue dataset using this codebase.
The dialogue generation script dynamically builds conversation scenarios by combining information from several data files located in the /data directory. You will need to prepare the following files, ensuring they follow the specified format. Examples are provided below for clarity.
-
Themes and Topics (
data_themes.csv)- Purpose: This file establishes the high-level themes and the specific conversational topics related to them.
- Structure: A CSV with two columns:
ThemesandRelated Topics. - Example:
Themes,Related Topics Arts & Entertainment,Favorite Tv Shows From Childhood Arts & Entertainment,Favorite Musicians Or Bands
-
Subtopic / Scenario Templates (
data_topics_subtopics.csv)- Purpose: This file provides the narrative structure or "script" for a conversation on a given topic. It uses placeholders (we call these delexicalized entities, like
[LANGUAGE]) that will be filled in later. - Structure: A CSV with two columns:
TopicsandTemplate Sub Topics. Each line under a topic builds the conversational scenario. - Example:
Topics,Template Sub Topics Favorite TV Shows from Childhood,Two people discuss the influence of [LANGUAGE] folklore in their favorite childhood TV shows. ,"Person A loved a popular [LANGUAGE] [TV_SHOWS-1], while Person B grew up watching [LANGUAGE] [TV_SHOWS-2] on TV."
- Purpose: This file provides the narrative structure or "script" for a conversation on a given topic. It uses placeholders (we call these delexicalized entities, like
-
Entities (
data_entities.csv)- Purpose: This file defines the mapping between a placeholder (a
delexicalized_ent) and its possible real-world values (itslexicalized_ent). The script uses this to replace placeholders like[LANGUAGE]with actual languages like "Thai" or "Indonesian". - Structure: A CSV with two columns:
delexicalized_entandlexicalized_ent. - Example:
delexicalized_ent,lexicalized_ent [LANGUAGE],"Thai-tha Indonesian-ind Javanese-jav"
- Purpose: This file defines the mapping between a placeholder (a
-
Personas (
data_personas.csv)- Purpose: This file defines character archetypes. It links a theme and topic to a personality template, which also uses placeholders to create varied and specific personas.
- Structure: A CSV with columns like
Themes,Related Topics, andTemplate Personas. - Example:
Themes,Related Topics,Template Personas Arts & Entertainment,Favorite Tv Shows From Childhood,"A person fascinated by traditional [MOVIE_TYPE] and mythological characters: [MYTH_CHARACTER]"
-
Personalities (
data_personality.csv)- Purpose: A simple list of personality traits that can be assigned to the dialogue agents to influence their tone and style.
- Structure: A CSV with a single
Personalitycolumn. - Example:
Personality Active Creative Honest
-
Names (
data_names.csv)- Purpose: Provides a pool of culturally-specific first and last names, sorted by gender and language/region, to assign to the dialogue personas.
- Structure: A CSV with columns for
Genderand language-specific names (e.g.,Thailand FirstName,Indonesia FirstName). - Example:
Gender,Thailand FirstName,Indonesia FirstName Male,Ananda (อนันดา),Andi
-
High-Coupling Entity Map (
high_coupling_entities.json)- Purpose: This JSON file is crucial for logical consistency. It defines relationships between two entities that are tightly linked (or "highly coupled"). For example, a specific film title should be associated with a correct genre. This prevents the script from generating illogical combinations, like describing the action film "The Raid" as a "romance".
- Structure: A JSON object where
entity1andentity2are the linked placeholders. The file then lists the validentity2values for each specificentity1value. - Example: The file links
[FILM]and[MOVIE_TYPE]. The entry for"The_Raid-ind"specifies that its valid movie types are"martial_arts_epic-gen"and"action-gen".{ "entity1": "[FILM]", "entity2": "[MOVIE_TYPE]", "The_Raid-ind": [ "martial_arts_epic-gen", "action-gen" ], "Laskar_Pelangi-ind": [ "adventure-gen", "historical-drama-gen" ] }
To get started, you can modify the provided example files in the /data directory or create your own following these structures.
This step uses a series of scripts to convert the raw data from Step 1 into fully formed prompts that can be fed to a Large Language Model (LLM). The process involves "lexicalizing" the templates—that is, replacing the placeholders (e.g., [LANGUAGE]) with concrete data (e.g., "Indonesian").
Follow these sub-steps in order:
-
Generate Lexicalized Scenarios:
- Purpose: To fill in the scenario templates with specific entities.
- Action: Run the
subtopic_generation.pyscript. This will take the templates fromdata_topics_subtopics.csvand insert the corresponding lexicalized entities fromdata_entities.csvand other data files.
-
Generate Lexicalized Personas:
- Purpose: To create specific character personas by filling in the persona templates.
- Action: Run the
persona_generation.pyscript. Similar to the previous step, this replaces placeholders indata_personas.csvwith concrete data.
-
Select Topics with a Topic Transition:
- Purpose: To select two distinct topics for each dialogue. This creates a more challenging and realistic conversation where the speakers must transition between different subjects.
- Action: Run the
topic_clustering.pyscript. This code groups similar topics together. By selecting topics from different clusters, you can ensure the dialogue will feature a clear "topic shift."
-
Assemble the Final Prompt:
- Purpose: To combine the generated scenarios, personas, and topic pairs into a final, structured prompt.
- Action: Use the utility functions within
utils.py. This script will take the outputs from the previous steps and assemble them into the final format required by the LLM. You will need to provide the necessary inputs as specified within the script's functions.
With your final prompts ready, you can now generate the dialogues using your preferred LLM. We have included example scripts to call various popular models.
-
For Google Gemini Models:
- Use the script
codes/dialogue_generation/call_gemini.py.
- Use the script
-
For OpenAI GPT Models:
- Use the script
codes/dialogue_generation/call_gpt.py.
- Use the script
-
For Open-Source Models (via Hugging Face):
- Use the script
codes/dialogue_generation/call_vllm.py. This script utilizes the vLLM library for fast and efficient inference with models from the Hugging Face Hub.
- Use the script
You can access the SEADialogues dataset from the Hugging Face Hub. Below is an example of how to load the data using the datasets library.
from datasets import load_dataset
ds = load_dataset("SEACrowd/SEADialogues")If you find our work helpful, please cite the SEADialogues paper:
@misc{kautsar2025seadialoguesmultilingualculturallygrounded,
title={SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages},
author={Muhammad Dehan Al Kautsar and Aswin Candra and Muhammad Alif Al Hakim and Maxalmina Satria Kahfi and Fajri Koto and Alham Fikri Aji and Peerat Limkonchotiwat and Ekapol Chuangsuwanich and Genta Indra Winata},
year={2025},
eprint={2508.07069},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.07069},
}If you have any questions, feel free to open a GitHub Issue!


