-
Notifications
You must be signed in to change notification settings - Fork 51
Added the trexio support #393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideAdds TREXIO (.trexio) file format support using the HDF5 backend, including load/dump helpers and a subprocess-based roundtrip test, and wires trexio into the development dependencies. Sequence diagram for dumping IOData to TREXIO HDF5 filesequenceDiagram
participant Caller
participant dump_one
participant IOData
participant OS
participant trexio
participant HDF5File
Caller->>dump_one: call dump_one(f, data)
dump_one->>trexio: _import_trexio()
alt trexio_missing
dump_one-->>Caller: raise RuntimeError
else trexio_available
dump_one->>IOData: read atcoords, atnums, nelec, spinpol
dump_one->>dump_one: validate atcoords and atnums
alt invalid_atoms
dump_one-->>Caller: raise RuntimeError
else valid_atoms
dump_one->>dump_one: get filename from f.name
alt missing_name
dump_one-->>Caller: raise RuntimeError
else has_name
dump_one->>Caller: f.close()
dump_one->>OS: check and remove existing file
dump_one->>trexio: File(filename, w, TREXIO_HDF5)
trexio->>HDF5File: create and open file
dump_one->>trexio: write_nucleus_num
dump_one->>trexio: write_nucleus_charge
dump_one->>trexio: write_nucleus_coord
alt nelec_present
dump_one->>trexio: write_electron_num
alt spinpol_present
dump_one->>dump_one: compute n_up, n_dn
dump_one->>trexio: write_electron_up_num
dump_one->>trexio: write_electron_dn_num
end
end
trexio-->>dump_one: close HDF5 file
dump_one-->>Caller: return None
end
end
end
Sequence diagram for loading IOData from TREXIO HDF5 filesequenceDiagram
participant Caller
participant load_one
participant LineIterator
participant trexio
participant HDF5File
Caller->>load_one: call load_one(lit)
load_one->>trexio: _import_trexio()
alt trexio_missing
load_one-->>Caller: raise LoadError
else trexio_available
load_one->>LineIterator: read filename
load_one->>trexio: File(filename, r, TREXIO_HDF5)
trexio->>HDF5File: open existing file
load_one->>trexio: read_nucleus_num
trexio-->>load_one: n_nuc
load_one->>trexio: read_nucleus_charge
trexio-->>load_one: charges
load_one->>trexio: read_nucleus_coord
trexio-->>load_one: coords
load_one->>load_one: convert to numpy arrays
load_one->>trexio: read_electron_num
alt electron_num_missing
load_one->>load_one: nelec = None
else electron_num_present
trexio-->>load_one: nelec
end
load_one->>trexio: read_electron_up_num, read_electron_dn_num
alt spinpol_missing
load_one->>load_one: spinpol = None
else spinpol_present
trexio-->>load_one: n_up, n_dn
load_one->>load_one: spinpol = n_up - n_dn
end
load_one->>load_one: validate charges and coords length
alt inconsistent_nucleus_fields
load_one-->>Caller: raise LoadError
else consistent_nucleus_fields
load_one->>load_one: atnums = round(charges)
load_one->>load_one: build result dict
alt nelec_present
load_one->>load_one: compute charge from charges and nelec
end
load_one-->>Caller: result dict
end
end
Updated class diagram for TREXIO format integrationclassDiagram
class IOData {
atcoords: ndarray
atnums: ndarray
nelec: int
spinpol: int
}
class LineIterator {
filename: str
f: TextIO
}
class LoadError {
message: str
filename: str
}
class TrexioFormat {
+PATTERNS: list
+_import_trexio()
+load_one(lit)
+dump_one(f, data)
}
class trexio_File {
+File(filename, mode, back_end)
+read_nucleus_num(file)
+read_nucleus_charge(file)
+read_nucleus_coord(file)
+read_electron_num(file)
+read_electron_up_num(file)
+read_electron_dn_num(file)
+write_nucleus_num(file, n_nuc)
+write_nucleus_charge(file, charges)
+write_nucleus_coord(file, coords)
+write_electron_num(file, nelec)
+write_electron_up_num(file, n_up)
+write_electron_dn_num(file, n_dn)
}
TrexioFormat ..> IOData : uses
TrexioFormat ..> LineIterator : uses
TrexioFormat ..> LoadError : raises
TrexioFormat ..> trexio_File : delegates IO to
class document_load_one {
+document_load_one(fmt, required, optional)
}
class document_dump_one {
+document_dump_one(fmt, required, optional)
}
TrexioFormat ..> document_load_one : decorator
TrexioFormat ..> document_dump_one : decorator
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Here's the code health analysis summary for commits Analysis Summary
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- In
iodata/formats/trexio.py, theimportlibimport is unused and can be removed to keep the module minimal and avoid confusion about dynamic import usage. - The
dump_oneimplementation closes the file object passed in and then deletes/recreates the file by name; this side effect on the caller’s file handle may be surprising—consider refactoring the API to accept a filename directly for TREXIO or at least clearly constraindump_oneto operate only on real file paths and not arbitrary file-like objects.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `iodata/formats/trexio.py`, the `importlib` import is unused and can be removed to keep the module minimal and avoid confusion about dynamic import usage.
- The `dump_one` implementation closes the file object passed in and then deletes/recreates the file by name; this side effect on the caller’s file handle may be surprising—consider refactoring the API to accept a filename directly for TREXIO or at least clearly constrain `dump_one` to operate only on real file paths and not arbitrary file-like objects.
## Individual Comments
### Comment 1
<location> `iodata/formats/trexio.py:154-157` </location>
<code_context>
+ if nelec is not None:
+ trexio.write_electron_num(tfile, nelec)
+ if spinpol is not None:
+ n_up = (nelec + spinpol) // 2
+ n_dn = (nelec - spinpol) // 2
+ trexio.write_electron_up_num(tfile, n_up)
+ trexio.write_electron_dn_num(tfile, n_dn)
\ No newline at end of file
</code_context>
<issue_to_address>
**issue (bug_risk):** Validate `(nelec, spinpol)` consistency before using integer division to compute `n_up`/`n_dn`.
Using `// 2` relies on both `(nelec + spinpol)` and `(nelec - spinpol)` being even; inconsistent inputs will silently floor and yield an invalid electron configuration in the TREXIO file. Please add a check (e.g. assert or explicit error) that both expressions are even before computing `n_up`/`n_dn`, and fail loudly if the parity is wrong.
</issue_to_address>
### Comment 2
<location> `iodata/test/test_trexio.py:62-63` </location>
<code_context>
+iodata_new = load_one(filename, fmt="trexio")
+
+print("Verifying data...")
+np.testing.assert_allclose(iodata_new.atcoords, atcoords, err_msg="atcoords mismatch")
+np.testing.assert_equal(iodata_new.atnums, atnums, err_msg="atnums mismatch")
+assert float(iodata_new.nelec) == nelec, f"nelec mismatch: {iodata_new.nelec} != {nelec}"
+assert int(iodata_new.spinpol) == spinpol, f"spinpol mismatch: {iodata_new.spinpol} != {spinpol}"
+
+print("Verification passed")
</code_context>
<issue_to_address>
**suggestion (testing):** Extend round-trip assertions to also check the derived charge and possibly tolerate float round-off.
Since `load_one` derives `charge` from the nuclear charges and `nelec`, it would be helpful to also assert on `iodata_new.charge` (e.g. that it is 0.0 here). Also, because `nelec` is a float and may go through conversions, consider using `np.isclose` (or a small tolerance) instead of strict equality to keep the test robust to harmless representation changes while still catching real regressions.
```suggestion
np.testing.assert_allclose(
float(iodata_new.nelec),
nelec,
rtol=1.0e-8,
atol=1.0e-12,
err_msg=f"nelec mismatch: {iodata_new.nelec} != {nelec}",
)
np.testing.assert_allclose(
float(iodata_new.charge),
0.0,
rtol=1.0e-8,
atol=1.0e-12,
err_msg=f"charge mismatch: {iodata_new.charge} != 0.0",
)
assert int(iodata_new.spinpol) == spinpol, f"spinpol mismatch: {iodata_new.spinpol} != {spinpol}"
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
Hi @PaulWAyers , I have closed my previous PR and submitted this updated one now all checks are passing. |
tovrstra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I've got a few requests for changes.
| script = """ | ||
| import numpy as np | ||
| import os | ||
| import sys | ||
| from iodata import IOData | ||
| from iodata.api import load_one, dump_one | ||
| atcoords = np.array([[0.0, 0.0, 0.0], [1.0, 0.0, 0.0]]) | ||
| atnums = np.array([1, 1]) | ||
| nelec = 2.0 | ||
| spinpol = 0 | ||
| iodata_orig = IOData(atcoords=atcoords, atnums=atnums, nelec=nelec, spinpol=spinpol) | ||
| filename = "test.trexio" | ||
| if os.path.exists(filename): | ||
| os.remove(filename) | ||
| print(f"Dumping to {filename}") | ||
| dump_one(iodata_orig, filename, fmt="trexio") | ||
| print(f"Loading from {filename}") | ||
| iodata_new = load_one(filename, fmt="trexio") | ||
| print("Verifying data...") | ||
| np.testing.assert_allclose(iodata_new.atcoords, atcoords, err_msg="atcoords mismatch") | ||
| np.testing.assert_equal(iodata_new.atnums, atnums, err_msg="atnums mismatch") | ||
| np.testing.assert_allclose( | ||
| float(iodata_new.nelec), | ||
| nelec, | ||
| rtol=1.0e-8, | ||
| atol=1.0e-12, | ||
| err_msg=f"nelec mismatch: {iodata_new.nelec} != {nelec}", | ||
| ) | ||
| np.testing.assert_allclose( | ||
| float(iodata_new.charge), | ||
| 0.0, | ||
| rtol=1.0e-8, | ||
| atol=1.0e-12, | ||
| err_msg=f"charge mismatch: {iodata_new.charge} != 0.0", | ||
| ) | ||
| assert int(iodata_new.spinpol) == spinpol, ( | ||
| f"spinpol mismatch: {iodata_new.spinpol} != {spinpol}" | ||
| ) | ||
| print("Verification passed") | ||
| """ | ||
| script_file = tmp_path / "verify_trexio_subprocess.py" | ||
| script_file.write_text(script, encoding="utf-8") | ||
|
|
||
| # Determine project root (assuming this test is in iodata/test/) | ||
| current_dir = os.path.dirname(os.path.abspath(__file__)) | ||
| project_root = os.path.abspath(os.path.join(current_dir, "../..")) | ||
|
|
||
| # Add project root to PYTHONPATH to ensure local iodata code is used | ||
| env = os.environ.copy() | ||
| current_pythonpath = env.get("PYTHONPATH", "") | ||
| env["PYTHONPATH"] = f"{project_root}:{current_pythonpath}" | ||
|
|
||
| subprocess.check_call([sys.executable, str(script_file)], cwd=tmp_path, env=env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This construction with the subprocess should be avoided. You can make the test conditional on the availability of the trexio package without a subprocess. Something along the following lines:
try:
import trexio
except ImportError:
trexio = None
@pytest.mark.skipif(trexio is None, reason="requires trexio")
def test_something():
...| "sphinx_autodoc_typehints", | ||
| "sphinx-copybutton", | ||
| "sympy", | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid unintended changes due to edits that were later undone. (E.g. by checking the diffs.)
In order to test the trexio code, it would be useful to include a set of extra optional dependencies like this:
extra = [
"trexio",
]With this, you can update the following line in the CI workflow:
iodata/.github/workflows/pytest.yaml
Line 48 in fa4d09d
| run: pip install -e .[dev] |
E.g. change it to:
run: pip install -e .[dev,extra]
|
I've looked a bit into how trexio works, because I was surprised to see HDF5, which has known data integrity issues (low probability, but likely enough if used at the exa-scale, which the format seems to aim for.) This is a bit strange. It's not really our concern, so fine by me to add some support for this format in IOData, but not something I would generally recommend if one has a choice. A Zarr (or even simple NPZ) backend would make more sense. Another concern is that trexio may not be a good format to support in IOData through the load_one and dump_one functions. Our API is designed for text file formats that can be implemented in IOData, i.e. without handing it over to another library. One can argue that IOData's API can be generalized to support different scenario's, but that would then have to be refactored first. |
|
Hi @tovrstra , Thanks for the detailed context, it helped clarify the concerns. I understand that IODATA’s API mainly works well with text file formats. Supporting a binary, library-backed format like TREXIO goes beyond the original design. I agree that creating a more general solution would probably need a larger API update, which is not part of this PR. I attempted to apply the requested configuration changes:
However, enabling it in the CI Workflow caused several checks to fail. hence, i reverted the "Fix CI Workflow" commit for now. Let me know how you'd like to proceed! |
|
It seems that you have rebased on main (not sure) in a way that Git thinks that commits from main are new in this branch. I'm going to prune your branch to avoid this redundancy. For keeping up with changes in main, it is simpler to just merge the main branch back into a feature branch. Rebasing is sometimes nice, but can have quite a few unintended side effects. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
This reverts commit 64b7c21.
fbd5171 to
f64257e
Compare
|
My apologies for force-pushing. There seemed to be no other way to remove unrelated commits from the PR. |
|
@dhruvDev23 Can you provide instructions on how you ran the tests locally, including how you installed trexio? I've tried with @PaulWAyers Can you comment on how much we should care about TrexIO? At this stage, my judgement is that it is too much of a mess to support. Before making any decision on how to proceed, I want to get a more accurate assessment and have it work locally. |
|
Hi @tovrstra , Thanks for cleaning up my branch. I have synced my local setup as well. My Environment
Installation of TrexioI have installed dependencies using 'pip install -e .[dev,trexio]', which pulled Running TestsI have used the command Given the inconsistent behavior of the PyPI package across platforms, I am happy to follow your lead on whether it makes sense to pause this integration or wait until the upstream packaging situation is clearer. |
|
I think that TrexIO is probably about the most likely of the (too) many formats for interoperable quantum chemistry data to survive, based on the prominence of the groups supporting it and their expertise/resources/committment. Admittedly, there is no good solution for FAIR quantum chemistry data yet. I know that David Sherrill is organizing a meeting at Georgia Tech this Spring (supported by the US NSF) to discuss these things. One could decide to defer until that sorts itself out but I have doubt it will be fast (or effective) given what a charlie foxtrot it turned out to be when MolSSI (which had more resources than anyone) attempted. So probably the real answer is that it is going to be clunky for a while (eventually AI is probably going to be the solution; converting information between formats is very AI-able) but if @dhruvDev23 is committed to doing it and maintaining it, I think it's a good feature to have as it ties us into one of (maybe the most prominent) quantum chemistry ecosystem that's committed to FAIR data/software principles. |
|
Thanks for commenting. I'll try to get it to work locally to get a better idea on how to proceed. Eventually, we should support it given the community interest. One thing to consider is an independent implementation that simply builds of h5py instead of the TrexIO library, making it trivially cross-platform. |
|
Thanks both of you for the feedback. As you mentioned, a TREXIO implementation has merit given the community interest, even if it currently has some shortcomings. I am also open to the idea of an At this point, I am comfortable waiting while @tovrstra tries things out locally and following whatever direction you feel best fits the project. Once there’s a more clearly defined direction for the TREXIO support and its intended scope, I’d be more than happy to help with the implementation. |
Fixes #384
(@tovrstra added the first line to this message and remove this info from the title to be in line with common practices on GitHub.)
This PR adds the read and write support for the Trexio file format using the HDF5 backend.
Implementation
New Feature
Testing
Summary by Sourcery
Add support for reading and writing TREXIO files using the HDF5 backend and verify round-trip consistency.
New Features:
Build:
Tests: