CausalEarthNet

Learning the set-wise causal structure of the Earth system using hypergraphs that outperforms pairwise models in identifying climate tipping points and enabling transportable counterfactual forecasts

Code Structure

CausalEarthNet
├── data
│   ├── download_era5.py
│   ├── preprocess.py
│   ├── aggregate.py
│   └── feature_set.csv
│   └── regional_timeseries_final.csv
├── src
│   ├── config.py
│   ├── hypergraph_discovery.py
│   └── baseline.py
└── main.py
└── requirements.txt

Setup and Dependencies

Step 1 : Prerequisities and Required Libraries

Before running the data acquisition step, you must set up the necessary environment. All requirements are listed in requirements.txt file

pip install -r requirements.txt

or copy paste below:

# Core Data Handling and Scientific Computing
pandas
numpy
xarray
matplotlib

# Causal Inference and Modeling
tigramite
scikit-learn

# Data Acquisition and I/O Backends (Crucial for ERA5 formats)
cdsapi
cfgrib
netcdf4
h5netcdf

Step 2: Data Acquisition

To access the data from Climate Data Store(CDS), follow the steps below:

Setup the CDS API key
Create or register an account at https://cds.climate.copernicus.eu
Install the CDS API
The CDS API client is a Python based library. It provides support for Python 3.
```
pip install cdsapi
```
Use the CDS API client for data access
Once the CDS API client is installed, it can be used to request data from the datasets listed in the CDS, ADS, ECDS, XDS and CEMS Early Warning DS catalogues. The bottom of each dataset download form has "Show API request code" button to display a Python code snippet that can be used to run the manually built request. For example:
```
import cdsapi
client = cdsapi.Client()
dataset = "<DATASET-SHORT-NAME>"
request = {
    <SELECTION-REQUEST>
}
target = "<TARGET-FILE>"
client.retrieve(dataset, request, target)
```

Analysis Pipeline

Step 3 : Data Download and Preprocessing

This stage generates the essential input file, regional_timeseries_final.csv

3.1.Download raw ERA5 data

The script below downloads 20 years (2000-2019) of monthly mean ERA5 data for the variables and pressure levels needed for our regional proxies.

python data/download_era5.py 
# Downloads 'era5_surface_monthly.nc' and 'era5_pressure_monthly.nc'

3.2 Aggregate Regional Time Series

This script loads the raw, multi-dimensional NetCDF files, for example:

ds1 = xr.open_dataset('era5_pressure_monthly.nc', engine='netcdf4')

merges them (xr.merge) and process them into the clean, univariate regional time series by performing vertical level selection (e.g., z500) and spatial averaging over defined regions (e.g., ENSO, Arctic, E_Africa).

python data/aggregate_data.py
# Generates 'regional_timeseries_final.csv'

3.2.1. Defining Regional Proxies and Levels

Region Name	Latitude (°)	Longitude (°)	Relevant Variable	Pressure Level	Final Column Name
ENSO	-5 to 5	190 to 240	Temperature (t)	1000 hPa	t1000 ENSO
Midlat	40 to 60	-100 to 100	Temperature (t)	1000 hPa	t1000 Midlat
Arctic	70 to 90	0 to 360	Geopotential (z)	500 hPa	z500 Arctic
E_Africa	-10 to 10	30 to 50	Specific Humidity (q)	850 hPa	q850 E_Africa
E_Africa	-10 to 10	30 to 50	Total Precipitation (tp)	Surface	tp E_Africa

References:
https://www.ncei.noaa.gov/access/monitoring/enso/sst
https://www.ncei.noaa.gov/access/monitoring/ao/
https://www.cpc.ncep.noaa.gov/products/CDB/Extratropics/fige7.shtml
https://charlie.weathertogether.net/2012/09/el-ninosouthern-oscillation-enso-for/
https://pure.iiasa.ac.at/id/eprint/15033/1/Moon%20J.%20et%20al_SJFS_NO6.pdf
https://www.aoml.noaa.gov/phod/docs/lopez_kirtman_climate_dynamics_2018.pdf

3.2.2. Aggregation Step

Summarize multi-dimensional climate datasets into interpretable (uni-dimensional)signals for analysis, like climate indices or teleconnections proxies.

A. Vertical Selection:

Example:
data = data.sel(pressure_level = 500, method='nearest')

This isolates the signal at the layer most relevant to the physical process. This is not needed for surface-only fields (eg. precipitation)

B. Spatial Aggregation:

The code calculates the mean value across all latitude and longitude points within the defined box. (.mean(dim = [“latitude”, “longitude”]))

All resulting univariate time-series data are concatenated into a single pandas DataFrame, indexed by time, and saved as, regional_timeseries_final.csv, which is then ready for the next step.

3.3 Feature Engineering

Transforms the aggregated climate time series data (regional_timeseries_final.csv) into the standardized formats required for your causal inference and predictive modeling steps.

3.3.1 Anomaly Detection:

load_clean_data() : outputs a clean, stationary time series of anamoloies by removing the strong seasonal change, ensures the data relects unpredictable anomalies, and isolates non-seasonal physical teleconnections.

3.3.2. Handling Missing Values:

Original csv files has a lot of missing values. Uses forward-fill ffill and backward fill bfill, and then drops remaining NaN rows.

Ensures the resulting time series is complete and continuous

3.3.3 Conversion to tigramite dataframe:

df_anomaly_clean is converted into a tigramite dataframe to prepare it specifically for baseline analysis

3.3.4 Handling Time Lag: Additional step

create_lagged_dataframe() transforms the timeseries data to a feature matrix, and explicitly creates separate columns for every lagged time step up to max_lag values.

For example: If max_lag = 4, for t_{1000_Midlat}, it creates t_{1000_Midlat_t-1}, t_{1000_Midlat_t-2}, t_{1000_Midlat_t-3}, and t_{1000_Midlat_t-4}

The resulting dataframe is used to train Ridge regression models, and used as input features (X matrix) for target variable at time t (Y_t). Output: feature_set.csv

Step 4 : Causal Analysis and Comparison

This step uses the aggregated time series data to perform the two analyses: 1) the PCMCI+ baseline 2) Hypergraph method. The goal is to compare their predictive power (R^2)

4.1 PCMCI+ Baseline Comparison

src/pcmciplus_baseline.py has the baseline model.

PCMCI+(Partial Conditional Mutual Information & Conditional Independence)

Independence Test: CMIknn (Conditional Mutual Information, based on k-nearest neighbours) to capture non-linear relationships.

Output: List of most statistically significant (P-value > alpha) pairwise links (eg. t_1000 ENSO at (t-3) -> t_pE Africa). Used as input for the baseline model in the final comparison.

4.2 Hypergraph Discovery

src/hypergraph_discovery.py

Implements the hypergraph causal discovery method

4.2.1 Conditional Mutual Information :

conditional_mutual_information(X,Y,Z) estimates CMI (I(X_s;Y | Z), where Xs is set of lagged dricers, Y is the target variable at the present time t and Z is set of all other system variables

Significance: Higher CMI values indicates a stronger dependency between the set X and the target Y

4.2.2 Independence Test :

test_independence(X, Y, Z) determines if the observed CMI value is significantly significant.

If is below the alpha value(significance level), the hyperlink is considered significant

4.2.3 Search for hyperedges:

Function discover_hypergraph()

Records all set Xs that are sigificant as the discovered hyperedges

4.3 Performance Comparison

The pairwise links obtained from baseline model (PCMCI+) are used to identify the single best pairwise driver for the target variable, based on lowest p-value and highest CMI. Next, a ridge regression model is trained on this driver, and R2 value(r2_pair) is recorded as performance baseline.

Likwise, for hyperedge discovered, regression model is trained using all the features within this best hyperedge and r2_hyper value is recorded.

The claim is made by reporting the relative R2 improvement. The result quantitavely demonstrates that the set-wise interactions(hyperedges) significantly improve the predictive power over the strongest single-link baseline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CausalEarthNet

Code Structure

Setup and Dependencies

Step 1 : Prerequisities and Required Libraries

Analysis Pipeline

Step 3 : Data Download and Preprocessing

Step 4 : Causal Analysis and Comparison

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
src		src
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

thelc127/CausalEarthNet

Folders and files

Latest commit

History

Repository files navigation

CausalEarthNet

Code Structure

Setup and Dependencies

Step 1 : Prerequisities and Required Libraries

Analysis Pipeline

Step 3 : Data Download and Preprocessing

Step 4 : Causal Analysis and Comparison

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages