Add status attribute to output files for improved error handling and checkpointing #1754

Aaryan-549 · 2025-11-18T19:21:41Z

Description

This PR implements an MVP for checkpointing and error file tracking by adding a status attribute to output NetCDF files. This ensures that users and automated systems can easily determine whether a simulation completed successfully, is a checkpoint, or encountered an error condition.

Addresses: #1679

Problem

Currently, TORAX simulations that encounter errors (NaN detection, negative profiles, quasineutrality violations, or reaching minimum timestep) may not write output files consistently. This makes it difficult to:

Debug long-running simulations that fail.
Distinguish programmatically between completed and failed runs.
Implement checkpointing for verification, surrogate training, or long-pulse scenarios (e.g., DEMO, STEP, ARC).

Solution

This implementation follows the design feedback from the issue, prioritizing simplicity and maximal reuse of existing infrastructure.

New SimStatus Enum (_src/state.py):
- COMPLETED: Simulation reached t_final successfully.
- CHECKPOINT: Intermediate checkpoint (reserved for V2/future use).
- ERROR: Simulation stopped due to an error condition.
Status Attribute in Output (_src/output_tools/output.py):
- All NetCDF output files now include a global status attribute.
- This is automatically set based on the sim_error state.
- Accessible via data_tree.attrs['status'].
Consistent Error File Writing (_src/simulation_app.py):
- Output files are now explicitly written for ALL error conditions:
  - NAN_DETECTED
  - NEGATIVE_CORE_PROFILES
  - QUASINEUTRALITY_BROKEN
  - REACHED_MIN_DT
- Files contain all valid timesteps captured up to the error.

Design Principles Followed

Simplicity: Minimal changes with no complex logic introduced.
Reuse: Uses the standard NetCDF file format and existing output_tools.
No Breaking Changes: Purely additive; does not modify existing restart behavior or API.
Single Format: Checkpoints/Error files use the exact same schema as regular output files.

Changes

torax/_src/state.py: Added SimStatus StrEnum.
torax/_src/output_tools/output.py: Added logic to inject the status attribute into the DataTree.
torax/_src/simulation_app.py: Improved error logging and ensured file writing triggers on error states.
torax/_src/output_tools/tests/output_test.py: Added comprehensive tests for the status attribute.

Testing

Compilation: All files compile successfully.
New Tests: Verified status attribute is correctly set to COMPLETED for successful runs and ERROR for failed runs.
Regression: Existing tests pass; fully backward compatible.

Future Work (V2)

This MVP establishes the foundation for advanced checkpointing features planned for the future, including:

Periodic checkpoint writing during simulation execution.
Unification of Checkpoint and Restart APIs.
Configurable checkpoint intervals and automatic cleanup.

This commit implements a Custom Pedestal Model API that allows users to define custom pedestal scaling laws without modifying TORAX source code. Fixes google-deepmind#1711 ## Changes - Add CustomPedestalModel class supporting user-defined callable functions - Add CustomPedestal Pydantic configuration - Update PedestalConfig union to include CustomPedestal - Add comprehensive unit tests (7 test cases) - Add example configuration with EPED-like scaling - Add complete API documentation ## Features Users can now provide Python functions to compute: - Ion temperature at pedestal (T_i_ped) - Electron temperature at pedestal (T_e_ped) - Electron density at pedestal (n_e_ped) - Optional dynamic pedestal location (rho_norm_ped_top) Functions receive full access to runtime parameters, geometry, and core profiles, enabling machine-specific scaling laws (e.g., STEP pedestal models with Europed data fits). ## API Design Follows the transport model pattern with: - JAX Model Layer: CustomPedestalModel (frozen dataclass) - Pydantic Config Layer: CustomPedestal (validation) - Runtime Parameters: time-varying support Fully backwards compatible - no changes to existing models.

- Add SimStatus StrEnum (completed, checkpoint, error) to state.py - Add status attribute to nc output files in output.py - Ensure nc files are always written, even on errors - Add tests for status attribute with both completed and error states Addresses google-deepmind#1679 (MVP implementation) This implements a simple MVP for checkpointing/error file tracking: - Output nc files now include a 'status' attribute indicating whether the simulation completed, is a checkpoint, or encountered an error - The simulation writes output files for all error conditions (NaN, negative profiles, quasineutrality broken, reached min timestep) - Uses existing infrastructure - just the standard nc file format - No breaking changes to API

Aaryan-549 and others added 7 commits November 6, 2025 03:28

Refactor to use transport model registration pattern

7bba2ab

Merge branch 'google-deepmind:main' into main

2252755

Merge remote-tracking branch 'upstream/main'

edd7360

Delete PR_SUMMARY.md

99308ae

Delete PR_TEXT.md

778bb56

Aaryan-549 mentioned this pull request Nov 18, 2025

Feature Request: Simulation Checkpoint and Restart Capability #1679

Open

28 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add status attribute to output files for improved error handling and checkpointing #1754

Add status attribute to output files for improved error handling and checkpointing #1754

Uh oh!

Aaryan-549 commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add status attribute to output files for improved error handling and checkpointing #1754

Are you sure you want to change the base?

Add status attribute to output files for improved error handling and checkpointing #1754

Uh oh!

Conversation

Aaryan-549 commented Nov 18, 2025

Description

Problem

Solution

Design Principles Followed

Changes

Testing

Future Work (V2)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant