Skip to content

A repository for a computational text-as-data study on climate change coverage in 20 major Canadian newspapers since 1978

Notifications You must be signed in to change notification settings

antoinelemor/CCF-canadian-climate-framing

Repository files navigation

CCF-canadian-climate-framing

CCF Project Logo

Technical paper

View technical paper Download Pages F1 Score

The technical paper provides documentation of:

  • Complete annotation framework with 65 hierarchical categories
  • Machine learning methodology including model selection, training, and validation
  • Performance metrics for all categories (macro F1 = 0.866)
  • Database architecture and PostgreSQL implementation
  • Detailed validation protocols and inter-coder reliability assessments

Introduction

Welcome to the CCF-canadian-climate-framing repository. This project is dedicated to studying media coverage of climate change in the Canadian press through the most comprehensive machine-learning-preprocessed corpus of climate discourse available for research. The CCF Database comprises 266,271 articles from 20 Canadian newspapers (1978–2025) processed into 9.2 million sentence-level analytical units with 65 hierarchical annotations, achieving a macro F1 score of 0.866 across all categories. This is the first initiative of this scale in Canada known to the authors.

This work focuses on identifying and extracting a multitude of information by annotating the full texts of articles at the sentence level in order to analyze their complete content in the most detailed way, over time and across different Canadian regions and media outlets. We annotate more than 60 categories including eight thematic frames (economic, health, security, justice, political, scientific, environmental, cultural), actor networks, climate events, policy responses, emotional tone, and geographic focus. The database structure, implemented in PostgreSQL with indexed boolean columns, supports complex queries combining temporal, linguistic, geographic, and thematic dimensions. This repository contains all the scripts, data processing tools, and machine learning models necessary for conducting this study.

The database

This repository includes a newly compiled database of climate change articles from 20 major Canadian newspapers (n=266,271) (not available in plain text at this time for copyright reasons). The table below shows the distribution of articles per newspaper (after filtering and preprocessing), and the figure the distribution of articles through time.

Toronto Star Globe and Mail Vancouver Sun Edmonton Journal Le Devoir National Post Calgary Herald Whitehorse Daily Star Montreal Gazette Chronicle Herald The Telegram Times Colonist La Presse Plus La Presse Winnipeg Free Press Acadie Nouvelle Star Phoenix Le Droit Toronto Sun Journal de Montreal Total
46980 29442 17871 18162 13685 20032 19336 7603 9567 10770 5841 11800 9548 6917 12421 5143 7794 4727 3174 5458 266271

Number of Climate Change Articles Per Region in the CCF Corpus (1978-Present)


Table of contents


Members of the project

The project's main idea and objectives

The overarching goal of the project is to establish the first pan-Canadian database—comprehensive across time and space—of media articles on climate change, and to perform an in-depth sentence-level analysis of each article’s content.

The primary purpose is to understand the determinants of climate change media coverage in Canada, in order to inform future research and, ultimately, enhance communication on this topic.

To carry out this overarching research idea, the project is organized around the following objectives, which are currently underway:

Objective Description Status
Establish a comprehensive pan-Canadian database of climate change media articles Establish a comprehensive and representative database covering the entire country's media landscape with historical coverage across both time and space. Completed
Deploy an advanced sentence-level annotation pipeline Deploy an annotation pipeline that combines the precision of manual annotations and the scale of machine learning along with named entity extraction to process and annotate articles at the sentence level. Completed
Implement a rigorous, scientifically robust validation process for machine learning models Conduct comprehensive performance evaluations using statistical analyses and manual annotations to verify high classification accuracy and ensure research-grade reliability. In Progress
Publish the database and initial analyses Release the processed database and preliminary research findings for public use. Upcoming

Methodology

The research workflow for this project is structured as follows:

  1. Data acquisition and initial corpus: the foundational dataset comprises 266,271 articles related to climate change from 20 major Canadian newspapers, covering the period from 1978 to the present. Due to copyright restrictions, the raw text of these articles is not publicly available in this repository.

  2. Preprocessing: articles are processed using Scripts/Annotation/1_Preprocess.py to segment texts and generate analytical units (texts are segmented into two-sentence contexts), which are then used to annotate the articles. This step also involves data cleaning and format standardization.

  3. Database population: The processed textual data, along with article metadata, is organized and stored in a local PostgreSQL database named CCF. The script Scripts/Annotation/5_populate_SQL_database.py manages the creation of the database schema and populates key tables, including CCF_full_data (for raw article information) and CCF_processed_data (for tokenized and annotated sentences).

  4. Annotation strategy & model training:

    • A manual annotation phase is conducted to create high-quality labeled datasets for more than 60 categories (see below - What do we annotate and extract from texts?) through scripts such as Scripts/Annotation/2_JSONL.py (to prepare data for annotation tools), Scripts/Annotation/3_Manual_annotations.py (to count and analyze manual annotations), and Scripts/Annotation/4_JSONL_for_training.py (to structure data for machine learning) are employed. While the specific annotated datasets are not public due to copyright restrictions, they form the basis for training our machine learning models.
    • State-of-the-art transformer-based models (including CamemBERT and other BERT variants, managed via refactored libraries like AugmentedSocialScientist from Do et al. (2022)) are trained. The script Scripts/Annotation/6_Training_best_models.py is used to train and select the optimal models based on performance metrics from cross-validation.
  5. Automated corpus annotation: Once trained and validated, these machine learning models are applied to the entire corpus of 266,271 articles. Scripts/Annotation/7_Annotation.py performs this large-scale annotation for more than 60 categories (see below - What do we annotate and extract from texts?).

  6. Named Entity Recognition (NER): To further enrich the dataset, Named Entity Recognition is performed using Scripts/Annotation/8_NER.py. This script identifies and categorizes mentions of persons (PER), organizations (ORG), and locations (LOC) within the text. This process utilizes a hybrid approach that combines best SOTA models: spaCy for French PER and transformer models like CamemBERT/BERT-base-NER for other entities and English.

  7. Validation and Quality Control: The integrity and quality of the annotations are paramount. Scripts/Annotation/9_JSONL_for_recheck.py facilitates the creation of targeted subsets of data for manual re-verification, especially for underrepresented or ambiguous categories. Performance metrics, including precision, recall, and F1-scores for each annotated category, are systematically computed using Scripts/Annotation/10_Annotation_metrics.py to ensure transparency and the best qualitify of the annotation process.

What do we annotate and extract from texts ?

We annotate at the sentence level over 60 pieces of information and categories (frames, actors, emotions, etc.) :

Count (N) Main category or frame What the category captures What it means
1 Geographical Focus Canadian Context Situates climate change in Canada (places, actors, data, policies).
2 Events Any Climate-Related Event Mentions at least one of the following five event types.
3 Natural Disaster Imminence Arrival or unfolding of floods, wildfires, hurricanes, heatwaves, etc.
4 Climate Conference / Summit International meetings such as COP, UN summits, major national conferences.
5 Report Release Publication of governmental, NGO or scientific reports (e.g., IPCC, Lancet Countdown).
6 Election Campaign Climate issues raised during local, provincial or national elections.
7 Policy Announcement Debut or unveiling of new climate laws, regulations or action plans.
8 Actors & Messengers Any messenger quoted Presence of any messenger, expert or authority figure.
9 Medical & Public-Health Experts Physicians, epidemiologists, health ministers, public-health officials.
10 Economic & Finance Experts Economists, finance ministers, market analysts, central-bank officials.
11 Security & Defense Experts Military officers, defense strategists, security scholars.
12 Legal Experts Lawyers, judges, legal scholars, justice ministers.
13 Cultural Figures Artists, writers, athletes, arts scholars commenting on climate change.
14 Scientists (Natural or Social) Researchers or academics speaking in a scientific capacity.
15 Environmental Activists NGO spokespeople or well-known climate activists.
16 Political Actors Politicians, government officials, political scientists.
17 Climate Solutions Any Solution Mentioned Mentions any mitigation or adaptation measure.
18 Mitigation Strategies Measures to reduce GHG emissions or enhance carbon sinks.
19 Adaptation Strategies Measures to increase social or ecological resilience to climate impacts.
20 Health & Climate Any Health Link Mentions any relationship between climate and health.
21 Negative Health Impacts Heat stress, disease spread, respiratory issues, mental-health burdens, mortality.
22 Positive Health Impacts Benefits such as fewer cold-related deaths.
23 Health Co-benefits of Action Better air quality, improved diets, avoided premature deaths, mental well-being.
24 Health-Sector Footprint Emissions generated by hospitals, pharma supply chains, medical equipment.
25 Economy & Climate Any Economic Link Mentions any economic dimension of climate change.
26 Negative Economic Impacts Crop losses, tourism decline, productivity drops, rising insurance costs.
27 Positive Economic Impacts New growing zones, Arctic shipping routes, renewable-energy jobs.
28 Costs of Climate Action Debt burdens, competitiveness concerns, job displacement, budget trade-offs.
29 Benefits of Climate Action Economic growth, innovation leadership, job creation, cost savings.
30 Economic Sector Footprint Emissions from industry, transport, energy; accounting or reduction targets.
31 Security & Climate Any Security Link Mentions any security dimension.
32 Military Disaster Response Army called in for fires, floods, evacuations or relief.
33 Military Base Disruption Climate impacts on bases or military infrastructure readiness.
34 Climate-Driven Displacement Military management of evacuations or refugee camps.
35 Resource Conflict Tensions or violence over water, land or minerals worsened by climate change.
36 Defense-Sector Footprint Emissions and energy use of armed forces and defense contractors.
37 Justice & Climate Any Justice Link Mentions any social-justice angle.
38 Winners & Losers Groups that benefit or suffer from climate measures (workers, vulnerable populations, etc.).
39 North–South Responsibility Common-but-differentiated responsibilities between high-income and low-income countries.
40 Legitimacy of Responses Public trust, fairness, acceptability of climate policies.
41 Climate Litigation Court cases or legal challenges over climate responsibility.
42 Culture & Climate Any Culture Link Mentions any cultural aspect.
43 Artistic Representation Books, documentaries, plays, exhibitions portraying climate themes.
44 Event Disruption Sports or cultural events threatened or cancelled due to climate conditions.
45 Loss of Indigenous Practices Erosion of traditional hunting, fishing, or cultural rituals linked to climate.
46 Cultural-Sector Footprint Emissions from film production, fashion, large festivals, etc.
47 Environment & Climate Any Biodiversity Link Mentions any biodiversity concern.
48 Habitat Loss Glacier melt, coral bleaching, forest die-off, wetland drying.
49 Species Loss Local or global extinction risk for animals or plants.
50 Science & Climate Any Science Link Mentions any scientific aspect.
51 Scientific Controversy Debates on climate change reality, causes, thresholds, geo-engineering ethics.
52 Discovery & Innovation New findings on climate impacts or emerging technologies (e.g., carbon capture).
53 Scientific Uncertainty Expressions of doubt or uncertainty about climate science.
54 Scientific Certainty Strong consensus statements about climate science.
55 Politics & Policy Process Any Policy / Political Debate Mentions any policy measure or political discussion.
56 Policy Measures Concrete climate laws, regulations, or programmes under debate or in force.
57 Political Debate & Opinion Parliamentary disputes, party platforms, public-opinion polls on climate.
58 Extreme-Weather Mentions Weather Hazards Any specific storm, heatwave, flood, wildfire, drought, ice-melt referenced.
59 Emotional Tone Emotion Classification Detects presence and valence of emotion.
60 Positive Emotion Hope, optimism, pride, inspiration.
61 Negative Emotion Fear, anger, sadness, anxiety, loss.
62 Neutral / No Emotion Factual or analytical coverage with no clear emotional tone.
Named Entities Entity Extraction Detection of people, organisations and locations mentioned in the text.
63 Person Mentions Named individuals (PER).
64 Organization Mentions Institutions, corporations, agencies (ORG).
65 Location Mentions Geographic places such as cities, provinces, countries (LOC).

Additional Event sub-categories (aligned with Table A5)

  • Protest: organization or occurrence of a protest or demonstration (e.g., climate strike, anti‑pipeline protest, union rally).
  • Cultural event: organization or hosting of sports, artistic, or cultural events (e.g., Olympics, local marathon, film screening, concert, theatre).
  • Judiciary decision or trial: trials, court rulings, legal proceedings, or regulatory hearings (e.g., ruling on carbon pricing, decision on pipeline approval).

Illustrative results and analyses

Below is an illustrative example of the analyses conducted in this project. The animated GIF shows how the dominant climate-change frame evolves from year to year across Canadian provinces. For each article, the proportion of sentences mentioning a given frame is calculated; the frame with the highest average proportion in each province for each year is designated as the dominant frame. Gray-hatched provinces indicate insufficient data for that year.

Evolution of Dominant Climate Change Frames by Canadian Province (Yearly)


Citation

If you use this repository, the data, or the methodology in your research, please cite:

Lemor, A., Pillod, A. & Taylor, M. (2025). CCF-Canadian-Climate-Framing: A Repository for Analyzing Climate Change Narratives in Canadian Media. [Software/Data Repository]. GitHub. https://github.com/antoinelemor/CCF-canadian-climate-framing


Repository structure

CCF-Canadian-Climate-Framing/
├── Database/
│   ├── Database/
│   │   ├── CCF.media_database.csv _absent from the repository due to copyright restrictions_
│   │   ├── CCF.media_processed_texts.csv _absent from the repository due to copyright restrictions_
│   │   ├── CCF.media_processed_texts_annotated.csv _absent from the repository due to copyright restrictions_
│   │   ├── Canadian_Media_Articles_by_Province.csv
│   │   ├── Canadian_Media_by_Group.csv
│   │   ├── Database_media_count.csv
│   │   └── dominant_frames_yearly.gif
│   └── Training_data/
│       ├── manual_annotations_JSONL/ _excluded until our first publication_
│       │   ├── Annotated_sentences.jsonl _excluded_
│       │   ├── label_config.json _excluded_
│       │   ├── sentences_to_annotate_EN.jsonl _excluded_
│       │   ├── sentences_to_annotate_FR.jsonl _excluded_
│       │   ├── sentences_to_recheck_multiling.jsonl _excluded_
│       │   └── sentences_to_recheck_multiling_done.jsonl _excluded_
│       ├── annotation_bases/ _excluded until our first publication_
│       ├── training_database_metrics.csv
│       ├── models_metrics_summary_advanced.csv
│       ├── non_trained_models.csv
│       ├── manual_annotations_metrics.csv
│       ├── annotated_label_metrics.csv
│       └── final_annotation_metrics.csv
├── Scripts/
│   └── Annotation/
│       ├── 1_Preprocess.py
│       ├── 2_JSONL.py
│       ├── 3_Manual_annotations.py
│       ├── 4_JSONL_for_training.py
│       ├── 5_Populate_SQL_database.py
│       ├── 6_Training_best_models.py
│       ├── 7_Annotation.py
│       ├── 8_NER.py
│       ├── 9_JSONL_for_recheck.py
│       └── 10_Annotation_metrics.py
│       └── 11_Blind_verification.py
└── Models/ _contents are excluded due to file size and ongoing research_
└── requirements.txt

README.md

Usage

The project is organized into several scripts, each responsible for different aspects of data processing, annotation, and model training. Below is an overview of how to use them.

Annotation scripts

Annotation scripts

  1. Preprocess data

    python Scripts/Annotation/1_Preprocess.py
  2. Generate JSONL files

    python Scripts/Annotation/2_JSONL.py
  3. Manual annotations

    python Scripts/Annotation/3_Manual_annotations.py
  4. Prepare JSONL for training

    python Scripts/Annotation/4_JSONL_for_training.py
  5. Populate SQL database

    python Scripts/Annotation/5_populate_SQL_database.py
  6. Training best models

    python Scripts/Annotation/6_Training_best_models.py
  7. Annotation process

    python Scripts/Annotation/7_Annotation.py
  8. NER (Named Entity Recognition)

    python Scripts/Annotation/8_NER.py
  9. Generate JSONL for rechecking

    python Scripts/Annotation/9_JSONL_for_recheck.py
  10. Final annotation metrics

    python Scripts/Annotation/10_Annotation_metrics.py
  11. Blind verification of manual annotations

    python Scripts/Annotation/11_Blind_verification.py

Scripts overview

Annotation scripts

1_Preprocess.py

Purpose: Preprocesses the media database CSV by generating sentence contexts and verifying date formats.

Key features: Splits texts into two-sentence contexts. Counts words and updates relevant columns. Saves processed data to a new CSV.

Dependencies: os, pandas, spacy

2_JSONL.py

Purpose: Converts processed text data into JSONL files for manual annotation, separating French and English sentences.

Key features: Loads and cleans CSV data. Removes duplicates. Splits data by language. Creates JSONL with metadata fields.

Dependencies: os, pandas, json

3_Manual_annotations.py

Purpose: Reads manual annotations from a JSONL file, counts label usage, and exports annotation metrics.

Key features: Calculates label usage distribution. Outputs CSV with label proportions.

Dependencies: json, csv, os

4_JSONL_for_training.py

Purpose: Prepares manually annotated JSONL data for training/validation splits.

Key features: Splits data into train/validation sets. Handles stratification for main/sub labels. Exports annotation metrics to a CSV.

Dependencies: json, os, random, csv

5_populate_SQL_database.py

Purpose: Create the local PostgreSQL database CCF and populate it with two tables drawn from the project’s CSV files (CCF_full_data and CCF_processed_data) containing all the extracted articles.

Due to copyright restrictions, the code that were used to extract the articles are not published

6_Training_best_models.py

Purpose: Trains selected best models using advanced metrics from cross-validation.

Key features: Loads best epoch from models_metrics_summary_advanced.csv. Summarizes fully trained/partial/not trained status. Logs results and error handling.

Dependencies: os, sys, glob, shutil, json, pandas, torch, AugmentedSocialScientist

7_Annotation.py

Purpose: Applies trained English and French models to annotate the main database, saving or resuming progress as needed.

Key features: Loads/updates existing annotation columns. Performs annotation for detection, sub-categories, etc. Logs and saves partial results to handle interruptions.

Dependencies: torch, tqdm, pandas, numpy

8_NER.py

Purpose: This script performs large-scale Named Entity Recognition (PER, ORG, LOC) on the sentence-level data stored in the PostgreSQL table CCF_processed_data.

Key features: Language-aware NER pipelines in French (spaCy for PER + CamemBERT for ORG/LOC) and English (BERT-base-NER for PER/ORG/LOC).

Dependencies: psycopg2, pandas, torch, tqdm, joblib, spacy, transformers

9_JSONL_for_recheck.py

Purpose:
Builds a multilingual JSONL file to re-check models annotations directly from the PostgreSQL table CCF_processed_data to ensure statistically robust sub-class evaluation.

Key features: Uses root-inverse weighted sampling with hard constraints to ensure balanced representation across rare and common labels while maintaining language distribution and excluding previously annotated sentences.

Dependencies: pandas, psycopg2, tqdm, json, math, random

10_Annotation_metrics.py

Purpose: Benchmarks the model-generated sentence annotations (stored in CCF_processed_data) against a gold-standard JSONL and outputs a CSV with precision, recall, and F1 for each label, both classes (1 = positive, 0 = negative), each language (EN, FR) and the combined corpus (ALL), plus micro, macro, and weighted averages.

Key features: PostgreSQL pull with automatic dtype coercion, language-aware confusion matrices, per-class metrics, aggregated “ALL” row, four-decimal wide-format CSV export, tqdm progress bar, and clear console logging.

Dependencies: csv, json, os, pathlib, collections, typing, pandas, psycopg2, tqdm.

11_Blind_verification.py

Purpose:
Creates a blind-verification copy of any manual-annotation JSONL by wiping all labels, so annotators can re-label sentences without bias.

Key features: Efficiently processes large JSONL files with streaming I/O, automatic output directory creation, CLI arguments with sensible defaults, optional progress tracking, and robust error handling.

Dependencies: argparse, json, pathlib, sys, tqdm.

About

A repository for a computational text-as-data study on climate change coverage in 20 major Canadian newspapers since 1978

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published