Skip to content

SEADialogues: A Multilingual Culturally Grounded Dialogue Dataset

Notifications You must be signed in to change notification settings

SEACrowd/SEADialogues

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

119 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SEADialogues: A Multilingual Culturally Grounded Dialogue Dataset

HF Dataset arXiv License

SEADIALOGUES Pipeline

SEADialogues is a new benchmark dataset designed to address the lack of cultural nuance in existing dialogue systems. It features 32,000 multi-turn, persona-rich conversations across 8 Southeast Asian languages. By integrating manually curated cultural knowledge directly into the generation pipeline, SEADIALOGUES provides a high-quality resource for building and evaluating more inclusive, culturally aware, and human-like conversational AI.

📦 Contents

🤔 Why SEADialogues?

Current dialogue systems and datasets often overlook the cultural nuances that are essential for natural human conversation. Many large language models (LLMs) struggle to reflect cultural values accurately or generate contextually relevant, culture-specific references. This is a significant gap, especially for culturally diverse regions like Southeast Asia, which is home to over 700 million people and hundreds of languages.

SEADIALOGUES was created to address these challenges by providing a dataset that is:

  • Culturally Grounded: Dialogues are built on a foundation of manually curated cultural knowledge, including local entities, foods, and communication norms relevant to Southeast Asia.
  • Persona-Rich: The dataset includes 210 diverse personas to facilitate the generation of personalized and engaging conversations.
  • Multilingual: It covers 8 languages from 6 Southeast Asian countries, many of which are low-resource: Indonesian, Javanese, Minangkabau, Thai, Malay, Vietnamese, Tamil, and Tagalog.
  • Multi-Turn & Multi-Topic: Conversations are designed to be dynamic, often shifting between two distinct topics to better mimic real-world interactions.

Dataset Comparison Table

A comparison showing how SEADIALOGUES differs from existing dialogue datasets, highlighting its unique focus on cultural relevance.

📊 Dataset Features

SEADIALOGUES is the first dataset of its kind to explicitly represent cultural aspects within each conversation.

Feature Count
Languages 8
Dialogues 32,000
Personas 210
Topics 100
Scenarios 300
Avg. Utterances per Dialogue 13.86
Avg. Words per Utterance 21.69

⚙️ Data Generation Pipeline

The dataset was constructed using a meticulous four-stage pipeline to ensure high-quality, culturally appropriate dialogues.

SEADIALOGUES Pipeline

  1. Template Generation: Abstract templates for scenarios and personas were created with delexicalized placeholders (e.g., [COUNTRY], [FOOD]). Over 100 topics and 210 personas were used to generate scenarios.
  2. Lexicalization: The templates were populated with culturally relevant entities from a curated knowledge pool. This step ensures that entities are contextually and linguistically aligned with the target language and culture (e.g., pairing "Jakarta" with "Indonesia").
  3. Dialogue Generation: LLMs were prompted with the lexicalized templates, personas, and specific instructions to generate multi-turn dialogues that transition smoothly between topics.
  4. Annotation & Evaluation: Generated dialogues underwent a rigorous, two-fold evaluation process:
    • Human Annotation: Native speakers assessed dialogues on fluency, coherence, engagingness, naturalness, and cultural relevance.
    • Automatic Evaluation: LLM-as-a-judge methods, including G-Eval, M-Prometheus, and R3, were used to provide scalable, quantitative quality metrics.

🧩 Using Our Codebase

This codebase is intended for reproducing our experiments. It consists of the following components:

  • codes/correlation_analysis: Code for calculating the alignment between human and automatic evaluations.
  • codes/dialogue_generation: Code for generating multi-turn dialogues.
  • codes/g_eval: Code for automatic evaluation using the G-Eval method.
  • codes/reward_model: Code for automatic evaluation using the M-Prometheus and R3 reward model methods.
  • data: Contains the prerequisite data required to create prompts for dialogue generation.
  • output: Contains the generated prompts used for dialogue generation.
  • results: Contains the generated multi-turn dialogues. This directory also includes the results from our human evaluations, automatic evaluations, and alignment analysis.

Here is the revised documentation that explains each component in detail, using the examples you provided.

🏃🏼‍♂️‍➡️ Steps to Generate Dialogues

This guide outlines the steps to generate your own dialogue dataset using this codebase.

Step 1: Prepare the Prerequisite Data

The dialogue generation script dynamically builds conversation scenarios by combining information from several data files located in the /data directory. You will need to prepare the following files, ensuring they follow the specified format. Examples are provided below for clarity.

  • Themes and Topics (data_themes.csv)

    • Purpose: This file establishes the high-level themes and the specific conversational topics related to them.
    • Structure: A CSV with two columns: Themes and Related Topics.
    • Example:
      Themes,Related Topics
      Arts & Entertainment,Favorite Tv Shows From Childhood
      Arts & Entertainment,Favorite Musicians Or Bands
      
  • Subtopic / Scenario Templates (data_topics_subtopics.csv)

    • Purpose: This file provides the narrative structure or "script" for a conversation on a given topic. It uses placeholders (we call these delexicalized entities, like [LANGUAGE]) that will be filled in later.
    • Structure: A CSV with two columns: Topics and Template Sub Topics. Each line under a topic builds the conversational scenario.
    • Example:
      Topics,Template Sub Topics
      Favorite TV Shows from Childhood,Two people discuss the influence of [LANGUAGE] folklore in their favorite childhood TV shows.
      ,"Person A loved a popular [LANGUAGE] [TV_SHOWS-1], while Person B grew up watching [LANGUAGE] [TV_SHOWS-2] on TV."
      
  • Entities (data_entities.csv)

    • Purpose: This file defines the mapping between a placeholder (a delexicalized_ent) and its possible real-world values (its lexicalized_ent). The script uses this to replace placeholders like [LANGUAGE] with actual languages like "Thai" or "Indonesian".
    • Structure: A CSV with two columns: delexicalized_ent and lexicalized_ent.
    • Example:
      delexicalized_ent,lexicalized_ent
      [LANGUAGE],"Thai-tha
      Indonesian-ind
      Javanese-jav"
      
  • Personas (data_personas.csv)

    • Purpose: This file defines character archetypes. It links a theme and topic to a personality template, which also uses placeholders to create varied and specific personas.
    • Structure: A CSV with columns like Themes, Related Topics, and Template Personas.
    • Example:
      Themes,Related Topics,Template Personas
      Arts & Entertainment,Favorite Tv Shows From Childhood,"A person fascinated by traditional [MOVIE_TYPE] and mythological characters: [MYTH_CHARACTER]"
      
  • Personalities (data_personality.csv)

    • Purpose: A simple list of personality traits that can be assigned to the dialogue agents to influence their tone and style.
    • Structure: A CSV with a single Personality column.
    • Example:
      Personality
      Active
      Creative
      Honest
      
  • Names (data_names.csv)

    • Purpose: Provides a pool of culturally-specific first and last names, sorted by gender and language/region, to assign to the dialogue personas.
    • Structure: A CSV with columns for Gender and language-specific names (e.g., Thailand FirstName, Indonesia FirstName).
    • Example:
      Gender,Thailand FirstName,Indonesia FirstName
      Male,Ananda (อนันดา),Andi
      
  • High-Coupling Entity Map (high_coupling_entities.json)

    • Purpose: This JSON file is crucial for logical consistency. It defines relationships between two entities that are tightly linked (or "highly coupled"). For example, a specific film title should be associated with a correct genre. This prevents the script from generating illogical combinations, like describing the action film "The Raid" as a "romance".
    • Structure: A JSON object where entity1 and entity2 are the linked placeholders. The file then lists the valid entity2 values for each specific entity1 value.
    • Example: The file links [FILM] and [MOVIE_TYPE]. The entry for "The_Raid-ind" specifies that its valid movie types are "martial_arts_epic-gen" and "action-gen".
      {
          "entity1": "[FILM]",
          "entity2": "[MOVIE_TYPE]",
          "The_Raid-ind": [
              "martial_arts_epic-gen",
              "action-gen"
          ],
          "Laskar_Pelangi-ind": [
              "adventure-gen",
              "historical-drama-gen"
          ]
      }

To get started, you can modify the provided example files in the /data directory or create your own following these structures.

Step 2: Generating the Prompts

This step uses a series of scripts to convert the raw data from Step 1 into fully formed prompts that can be fed to a Large Language Model (LLM). The process involves "lexicalizing" the templates—that is, replacing the placeholders (e.g., [LANGUAGE]) with concrete data (e.g., "Indonesian").

Follow these sub-steps in order:

  1. Generate Lexicalized Scenarios:

    • Purpose: To fill in the scenario templates with specific entities.
    • Action: Run the subtopic_generation.py script. This will take the templates from data_topics_subtopics.csv and insert the corresponding lexicalized entities from data_entities.csv and other data files.
  2. Generate Lexicalized Personas:

    • Purpose: To create specific character personas by filling in the persona templates.
    • Action: Run the persona_generation.py script. Similar to the previous step, this replaces placeholders in data_personas.csv with concrete data.
  3. Select Topics with a Topic Transition:

    • Purpose: To select two distinct topics for each dialogue. This creates a more challenging and realistic conversation where the speakers must transition between different subjects.
    • Action: Run the topic_clustering.py script. This code groups similar topics together. By selecting topics from different clusters, you can ensure the dialogue will feature a clear "topic shift."
  4. Assemble the Final Prompt:

    • Purpose: To combine the generated scenarios, personas, and topic pairs into a final, structured prompt.
    • Action: Use the utility functions within utils.py. This script will take the outputs from the previous steps and assemble them into the final format required by the LLM. You will need to provide the necessary inputs as specified within the script's functions.

Step 3: Generating the Dialogues

With your final prompts ready, you can now generate the dialogues using your preferred LLM. We have included example scripts to call various popular models.

  • For Google Gemini Models:

    • Use the script codes/dialogue_generation/call_gemini.py.
  • For OpenAI GPT Models:

    • Use the script codes/dialogue_generation/call_gpt.py.
  • For Open-Source Models (via Hugging Face):

    • Use the script codes/dialogue_generation/call_vllm.py. This script utilizes the vLLM library for fast and efficient inference with models from the Hugging Face Hub.

🚀 Using Our Dataset

You can access the SEADialogues dataset from the Hugging Face Hub. Below is an example of how to load the data using the datasets library.

from datasets import load_dataset

ds = load_dataset("SEACrowd/SEADialogues")

📚 Citation

If you find our work helpful, please cite the SEADialogues paper:

@misc{kautsar2025seadialoguesmultilingualculturallygrounded,
      title={SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages}, 
      author={Muhammad Dehan Al Kautsar and Aswin Candra and Muhammad Alif Al Hakim and Maxalmina Satria Kahfi and Fajri Koto and Alham Fikri Aji and Peerat Limkonchotiwat and Ekapol Chuangsuwanich and Genta Indra Winata},
      year={2025},
      eprint={2508.07069},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.07069}, 
}

If you have any questions, feel free to open a GitHub Issue!

About

SEADialogues: A Multilingual Culturally Grounded Dialogue Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6