Skip to content

Conversation

@dhruvDev23
Copy link

@dhruvDev23 dhruvDev23 commented Dec 10, 2025

Fixes #384

(@tovrstra added the first line to this message and remove this info from the title to be in line with common practices on GitHub.)

This PR adds the read and write support for the Trexio file format using the HDF5 backend.

Implementation

New Feature

  • Implemented load_one and dump_one functions in iodata/formats/trexio.py
  • I have used the HDF5 backend to make sure the data is stored reliably.

Testing

  • I added a robust test case that saves data to a file and reads it back to make sure nothing is lost.

Summary by Sourcery

Add support for reading and writing TREXIO files using the HDF5 backend and verify round-trip consistency.

New Features:

  • Introduce TREXIO format handler with load and dump functions for nuclei and electron data.
  • Support automatic inference of molecular charge and spin polarization from TREXIO electron information.

Build:

  • Add trexio as a development dependency in the project configuration.

Tests:

  • Add subprocess-based round-trip test ensuring TREXIO dump and load preserve atomic and electronic data.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Dec 10, 2025

Reviewer's Guide

Adds TREXIO (.trexio) file format support using the HDF5 backend, including load/dump helpers and a subprocess-based roundtrip test, and wires trexio into the development dependencies.

Sequence diagram for dumping IOData to TREXIO HDF5 file

sequenceDiagram
    participant Caller
    participant dump_one
    participant IOData
    participant OS
    participant trexio
    participant HDF5File

    Caller->>dump_one: call dump_one(f, data)
    dump_one->>trexio: _import_trexio()
    alt trexio_missing
        dump_one-->>Caller: raise RuntimeError
    else trexio_available
        dump_one->>IOData: read atcoords, atnums, nelec, spinpol
        dump_one->>dump_one: validate atcoords and atnums
        alt invalid_atoms
            dump_one-->>Caller: raise RuntimeError
        else valid_atoms
            dump_one->>dump_one: get filename from f.name
            alt missing_name
                dump_one-->>Caller: raise RuntimeError
            else has_name
                dump_one->>Caller: f.close()
                dump_one->>OS: check and remove existing file
                dump_one->>trexio: File(filename, w, TREXIO_HDF5)
                trexio->>HDF5File: create and open file
                dump_one->>trexio: write_nucleus_num
                dump_one->>trexio: write_nucleus_charge
                dump_one->>trexio: write_nucleus_coord
                alt nelec_present
                    dump_one->>trexio: write_electron_num
                    alt spinpol_present
                        dump_one->>dump_one: compute n_up, n_dn
                        dump_one->>trexio: write_electron_up_num
                        dump_one->>trexio: write_electron_dn_num
                    end
                end
                trexio-->>dump_one: close HDF5 file
                dump_one-->>Caller: return None
            end
        end
    end
Loading

Sequence diagram for loading IOData from TREXIO HDF5 file

sequenceDiagram
    participant Caller
    participant load_one
    participant LineIterator
    participant trexio
    participant HDF5File

    Caller->>load_one: call load_one(lit)
    load_one->>trexio: _import_trexio()
    alt trexio_missing
        load_one-->>Caller: raise LoadError
    else trexio_available
        load_one->>LineIterator: read filename
        load_one->>trexio: File(filename, r, TREXIO_HDF5)
        trexio->>HDF5File: open existing file
        load_one->>trexio: read_nucleus_num
        trexio-->>load_one: n_nuc
        load_one->>trexio: read_nucleus_charge
        trexio-->>load_one: charges
        load_one->>trexio: read_nucleus_coord
        trexio-->>load_one: coords
        load_one->>load_one: convert to numpy arrays
        load_one->>trexio: read_electron_num
        alt electron_num_missing
            load_one->>load_one: nelec = None
        else electron_num_present
            trexio-->>load_one: nelec
        end
        load_one->>trexio: read_electron_up_num, read_electron_dn_num
        alt spinpol_missing
            load_one->>load_one: spinpol = None
        else spinpol_present
            trexio-->>load_one: n_up, n_dn
            load_one->>load_one: spinpol = n_up - n_dn
        end
        load_one->>load_one: validate charges and coords length
        alt inconsistent_nucleus_fields
            load_one-->>Caller: raise LoadError
        else consistent_nucleus_fields
            load_one->>load_one: atnums = round(charges)
            load_one->>load_one: build result dict
            alt nelec_present
                load_one->>load_one: compute charge from charges and nelec
            end
            load_one-->>Caller: result dict
        end
    end
Loading

Updated class diagram for TREXIO format integration

classDiagram
    class IOData {
        atcoords: ndarray
        atnums: ndarray
        nelec: int
        spinpol: int
    }

    class LineIterator {
        filename: str
        f: TextIO
    }

    class LoadError {
        message: str
        filename: str
    }

    class TrexioFormat {
        +PATTERNS: list
        +_import_trexio()
        +load_one(lit)
        +dump_one(f, data)
    }

    class trexio_File {
        +File(filename, mode, back_end)
        +read_nucleus_num(file)
        +read_nucleus_charge(file)
        +read_nucleus_coord(file)
        +read_electron_num(file)
        +read_electron_up_num(file)
        +read_electron_dn_num(file)
        +write_nucleus_num(file, n_nuc)
        +write_nucleus_charge(file, charges)
        +write_nucleus_coord(file, coords)
        +write_electron_num(file, nelec)
        +write_electron_up_num(file, n_up)
        +write_electron_dn_num(file, n_dn)
    }

    TrexioFormat ..> IOData : uses
    TrexioFormat ..> LineIterator : uses
    TrexioFormat ..> LoadError : raises
    TrexioFormat ..> trexio_File : delegates IO to

    class document_load_one {
        +document_load_one(fmt, required, optional)
    }

    class document_dump_one {
        +document_dump_one(fmt, required, optional)
    }

    TrexioFormat ..> document_load_one : decorator
    TrexioFormat ..> document_dump_one : decorator
Loading

File-Level Changes

Change Details Files
Introduce TREXIO reader that loads nuclei and electron data from HDF5-backed TREXIO files with validation and optional fields.
  • Add trexio-specific loader using LineIterator.filename and trexio.File in read mode with HDF5 backend.
  • Read nucleus count, charges, and coordinates; derive atomic numbers by rounding charges and validate array lengths against nucleus count.
  • Attempt to read total and spin-resolved electron counts, computing spin polarization when available, handling trexio errors gracefully.
  • Translate TrexIO and generic exceptions into LoadError with the file name for consistent error reporting.
  • Export loader metadata such as PATTERNS and decorate load_one with document_load_one for integration with IOData API.
iodata/formats/trexio.py
Introduce TREXIO writer that serializes IOData atomic and electron information into HDF5-backed TREXIO files.
  • Implement dump_one that requires atcoords/atnums, validates atom count consistency, and converts IOData fields to numpy arrays and ints.
  • Derive target filename from file object .name, enforcing that a real file handle is passed and raising RuntimeError otherwise.
  • Close the passed file handle, delete any existing file path to avoid conflicts, and create a new trexio.File in write mode with HDF5 backend.
  • Write nucleus count, charges, and coordinates, then optionally write electron count and spin-resolved up/down counts derived from nelec and spinpol.
  • Decorate dump_one with document_dump_one for consistent docs and format registration; reuse shared _import_trexio helper for optional dependency handling.
iodata/formats/trexio.py
Add subprocess-based integration test to verify TREXIO roundtrip read/write while avoiding trexio C-extension pytest conflicts.
  • Conditionally skip the test module at import time if trexio is not installed using importlib.util.find_spec.
  • Create a temporary Python script that builds a small IOData instance (H2-like system), dumps it to a .trexio file, reloads it, and asserts equality of coordinates, atomic numbers, nelec, and spinpol.
  • Run the helper script in a subprocess via subprocess.check_call with the project root injected into PYTHONPATH so that local iodata code is used.
  • Use tmp_path as working directory so files are isolated and cleaned up with the test fixture.
iodata/test/test_trexio.py
Wire trexio into project development dependencies for testing.
  • Add trexio to the dev extras list in pyproject.toml so that it is installed in development/test environments.
pyproject.toml

Possibly linked issues

  • #(unknown): The PR adds TrexIO load/dump support and tests, directly implementing the requested TrexIO support feature.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@deepsource-io
Copy link

deepsource-io bot commented Dec 10, 2025

Here's the code health analysis summary for commits fa4d09d..0d3c7c3. View details on DeepSource ↗.

Analysis Summary

AnalyzerStatusSummaryLink
DeepSource Shell LogoShell✅ SuccessView Check ↗
DeepSource Python LogoPython✅ SuccessView Check ↗

💡 If you’re a repository administrator, you can configure the quality gates from the settings.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • In iodata/formats/trexio.py, the importlib import is unused and can be removed to keep the module minimal and avoid confusion about dynamic import usage.
  • The dump_one implementation closes the file object passed in and then deletes/recreates the file by name; this side effect on the caller’s file handle may be surprising—consider refactoring the API to accept a filename directly for TREXIO or at least clearly constrain dump_one to operate only on real file paths and not arbitrary file-like objects.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `iodata/formats/trexio.py`, the `importlib` import is unused and can be removed to keep the module minimal and avoid confusion about dynamic import usage.
- The `dump_one` implementation closes the file object passed in and then deletes/recreates the file by name; this side effect on the caller’s file handle may be surprising—consider refactoring the API to accept a filename directly for TREXIO or at least clearly constrain `dump_one` to operate only on real file paths and not arbitrary file-like objects.

## Individual Comments

### Comment 1
<location> `iodata/formats/trexio.py:154-157` </location>
<code_context>
+        if nelec is not None:
+            trexio.write_electron_num(tfile, nelec)
+            if spinpol is not None:
+                n_up = (nelec + spinpol) // 2
+                n_dn = (nelec - spinpol) // 2
+                trexio.write_electron_up_num(tfile, n_up)
+                trexio.write_electron_dn_num(tfile, n_dn)
\ No newline at end of file
</code_context>

<issue_to_address>
**issue (bug_risk):** Validate `(nelec, spinpol)` consistency before using integer division to compute `n_up`/`n_dn`.

Using `// 2` relies on both `(nelec + spinpol)` and `(nelec - spinpol)` being even; inconsistent inputs will silently floor and yield an invalid electron configuration in the TREXIO file. Please add a check (e.g. assert or explicit error) that both expressions are even before computing `n_up`/`n_dn`, and fail loudly if the parity is wrong.
</issue_to_address>

### Comment 2
<location> `iodata/test/test_trexio.py:62-63` </location>
<code_context>
+iodata_new = load_one(filename, fmt="trexio")
+
+print("Verifying data...")
+np.testing.assert_allclose(iodata_new.atcoords, atcoords, err_msg="atcoords mismatch")
+np.testing.assert_equal(iodata_new.atnums, atnums, err_msg="atnums mismatch")
+assert float(iodata_new.nelec) == nelec, f"nelec mismatch: {iodata_new.nelec} != {nelec}"
+assert int(iodata_new.spinpol) == spinpol, f"spinpol mismatch: {iodata_new.spinpol} != {spinpol}"
+
+print("Verification passed")
</code_context>

<issue_to_address>
**suggestion (testing):** Extend round-trip assertions to also check the derived charge and possibly tolerate float round-off.

Since `load_one` derives `charge` from the nuclear charges and `nelec`, it would be helpful to also assert on `iodata_new.charge` (e.g. that it is 0.0 here). Also, because `nelec` is a float and may go through conversions, consider using `np.isclose` (or a small tolerance) instead of strict equality to keep the test robust to harmless representation changes while still catching real regressions.

```suggestion
np.testing.assert_allclose(
    float(iodata_new.nelec),
    nelec,
    rtol=1.0e-8,
    atol=1.0e-12,
    err_msg=f"nelec mismatch: {iodata_new.nelec} != {nelec}",
)
np.testing.assert_allclose(
    float(iodata_new.charge),
    0.0,
    rtol=1.0e-8,
    atol=1.0e-12,
    err_msg=f"charge mismatch: {iodata_new.charge} != 0.0",
)
assert int(iodata_new.spinpol) == spinpol, f"spinpol mismatch: {iodata_new.spinpol} != {spinpol}"
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@dhruvDev23 dhruvDev23 changed the title Added the trexio support Added the trexio support (Solves Issue #384) Dec 10, 2025
@dhruvDev23
Copy link
Author

Hi @PaulWAyers , I have closed my previous PR and submitted this updated one now all checks are passing.
Kindly review this PR at your convenience. Thank you.

Copy link
Member

@tovrstra tovrstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I've got a few requests for changes.

Comment on lines +37 to +96
script = """
import numpy as np
import os
import sys
from iodata import IOData
from iodata.api import load_one, dump_one
atcoords = np.array([[0.0, 0.0, 0.0], [1.0, 0.0, 0.0]])
atnums = np.array([1, 1])
nelec = 2.0
spinpol = 0
iodata_orig = IOData(atcoords=atcoords, atnums=atnums, nelec=nelec, spinpol=spinpol)
filename = "test.trexio"
if os.path.exists(filename):
os.remove(filename)
print(f"Dumping to {filename}")
dump_one(iodata_orig, filename, fmt="trexio")
print(f"Loading from {filename}")
iodata_new = load_one(filename, fmt="trexio")
print("Verifying data...")
np.testing.assert_allclose(iodata_new.atcoords, atcoords, err_msg="atcoords mismatch")
np.testing.assert_equal(iodata_new.atnums, atnums, err_msg="atnums mismatch")
np.testing.assert_allclose(
float(iodata_new.nelec),
nelec,
rtol=1.0e-8,
atol=1.0e-12,
err_msg=f"nelec mismatch: {iodata_new.nelec} != {nelec}",
)
np.testing.assert_allclose(
float(iodata_new.charge),
0.0,
rtol=1.0e-8,
atol=1.0e-12,
err_msg=f"charge mismatch: {iodata_new.charge} != 0.0",
)
assert int(iodata_new.spinpol) == spinpol, (
f"spinpol mismatch: {iodata_new.spinpol} != {spinpol}"
)
print("Verification passed")
"""
script_file = tmp_path / "verify_trexio_subprocess.py"
script_file.write_text(script, encoding="utf-8")

# Determine project root (assuming this test is in iodata/test/)
current_dir = os.path.dirname(os.path.abspath(__file__))
project_root = os.path.abspath(os.path.join(current_dir, "../.."))

# Add project root to PYTHONPATH to ensure local iodata code is used
env = os.environ.copy()
current_pythonpath = env.get("PYTHONPATH", "")
env["PYTHONPATH"] = f"{project_root}:{current_pythonpath}"

subprocess.check_call([sys.executable, str(script_file)], cwd=tmp_path, env=env)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This construction with the subprocess should be avoided. You can make the test conditional on the availability of the trexio package without a subprocess. Something along the following lines:

try:
    import trexio
except ImportError:
    trexio = None

@pytest.mark.skipif(trexio is None, reason="requires trexio")
def test_something():
    ...

"sphinx_autodoc_typehints",
"sphinx-copybutton",
"sympy",

Copy link
Member

@tovrstra tovrstra Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid unintended changes due to edits that were later undone. (E.g. by checking the diffs.)

In order to test the trexio code, it would be useful to include a set of extra optional dependencies like this:

extra = [
    "trexio",
]

With this, you can update the following line in the CI workflow:

run: pip install -e .[dev]

E.g. change it to:

        run: pip install -e .[dev,extra]

@tovrstra tovrstra changed the title Added the trexio support (Solves Issue #384) Added the trexio support Dec 13, 2025
@tovrstra
Copy link
Member

I've looked a bit into how trexio works, because I was surprised to see HDF5, which has known data integrity issues (low probability, but likely enough if used at the exa-scale, which the format seems to aim for.) This is a bit strange. It's not really our concern, so fine by me to add some support for this format in IOData, but not something I would generally recommend if one has a choice. A Zarr (or even simple NPZ) backend would make more sense.

Another concern is that trexio may not be a good format to support in IOData through the load_one and dump_one functions. Our API is designed for text file formats that can be implemented in IOData, i.e. without handing it over to another library. One can argue that IOData's API can be generalized to support different scenario's, but that would then have to be refactored first.

@dhruvDev23
Copy link
Author

dhruvDev23 commented Dec 16, 2025

Hi @tovrstra , Thanks for the detailed context, it helped clarify the concerns.

I understand that IODATA’s API mainly works well with text file formats. Supporting a binary, library-backed format like TREXIO goes beyond the original design. I agree that creating a more general solution would probably need a larger API update, which is not part of this PR.

I attempted to apply the requested configuration changes:

  • Defined the trexio optional dependency in pyproject.toml.
  • Updated .github/workflows/pytest.yaml to install it (.[dev,trexio]).

However, enabling it in the CI Workflow caused several checks to fail. hence, i reverted the "Fix CI Workflow" commit for now.

Let me know how you'd like to proceed!

@tovrstra
Copy link
Member

It seems that you have rebased on main (not sure) in a way that Git thinks that commits from main are new in this branch. I'm going to prune your branch to avoid this redundancy.

For keeping up with changes in main, it is simpler to just merge the main branch back into a feature branch. Rebasing is sometimes nice, but can have quite a few unintended side effects.

@tovrstra
Copy link
Member

My apologies for force-pushing. There seemed to be no other way to remove unrelated commits from the PR.

@tovrstra
Copy link
Member

@dhruvDev23 Can you provide instructions on how you ran the tests locally, including how you installed trexio? I've tried with pip install trexio and this seems to install a version without HDF5 support. Before I try further I'd like to understand how you use this library. It's a little strange that the package on PyPI is dysfunctional. (It wants to use HDF5 by default but then crashes because it is not included.)

@PaulWAyers Can you comment on how much we should care about TrexIO? At this stage, my judgement is that it is too much of a mess to support. Before making any decision on how to proceed, I want to get a more accurate assessment and have it work locally.

@dhruvDev23
Copy link
Author

Hi @tovrstra ,

Thanks for cleaning up my branch. I have synced my local setup as well.

My Environment

  • OS: macos (Apple Silicon)
  • Python: 3.10

Installation of Trexio

I have installed dependencies using 'pip install -e .[dev,trexio]', which pulled trexio==2.6.0 from pyPi. with this Setup, Trexio works locally for me without requiring a separate system HDF5 installation.

Running Tests

I have used the command pytest iodata/test/test_trexio.py.

Given the inconsistent behavior of the PyPI package across platforms, I am happy to follow your lead on whether it makes sense to pause this integration or wait until the upstream packaging situation is clearer.

@PaulWAyers
Copy link
Member

I think that TrexIO is probably about the most likely of the (too) many formats for interoperable quantum chemistry data to survive, based on the prominence of the groups supporting it and their expertise/resources/committment.

Admittedly, there is no good solution for FAIR quantum chemistry data yet. I know that David Sherrill is organizing a meeting at Georgia Tech this Spring (supported by the US NSF) to discuss these things. One could decide to defer until that sorts itself out but I have doubt it will be fast (or effective) given what a charlie foxtrot it turned out to be when MolSSI (which had more resources than anyone) attempted.

So probably the real answer is that it is going to be clunky for a while (eventually AI is probably going to be the solution; converting information between formats is very AI-able) but if @dhruvDev23 is committed to doing it and maintaining it, I think it's a good feature to have as it ties us into one of (maybe the most prominent) quantum chemistry ecosystem that's committed to FAIR data/software principles.

@tovrstra
Copy link
Member

Thanks for commenting. I'll try to get it to work locally to get a better idea on how to proceed. Eventually, we should support it given the community interest.

One thing to consider is an independent implementation that simply builds of h5py instead of the TrexIO library, making it trivially cross-platform.

@dhruvDev23
Copy link
Author

Thanks both of you for the feedback.

As you mentioned, a TREXIO implementation has merit given the community interest, even if it currently has some shortcomings. I am also open to the idea of an h5py based implementation as an alternative to a direct dependency on trexio, particularly if that helps with cross-platform stability.

At this point, I am comfortable waiting while @tovrstra tries things out locally and following whatever direction you feel best fits the project. Once there’s a more clearly defined direction for the TREXIO support and its intended scope, I’d be more than happy to help with the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support TrexIO

3 participants