-
Notifications
You must be signed in to change notification settings - Fork 2
Description
What happened?
I am testing PIXL for use in Manchester University NHS Foundation Trust.
Where I work has a requirement that we can only query our image database (PACS) using Study Instance UIDs, and not accession number or Mrn.
I prepared some test data in an (separate) orthanc service, and a csv file to use with pixl populate, along with a project config, etc.
I left the "accession_number" and "mrn" columns empty in the csv as we do not want to use these, and not including the information here seemed like the best way to ensure that.
Currently no images are being exported - I am still to investigate why.
I have come across an issue I believe to be a bug.
On the third run of pixl populate /path/to/csv, I had an error at cli/src/pixl_cli/_database.py::_filter_exported_messages() line 102, whereby the messages_df.merge(...) command was failing due to non-unique index.
On inspection the data in images_df was already duplicated (populated from the first 2 runs).
The cause of multiple data entry originates at cli/src/pixl_cli/_database.py::_filter_existing_images().
Looking at messages_df_reindexed in pdb, nan values are used for "accession_number" and "mrn" columns.
The same columns in images_df_reindexed contain the string "NaN".
This difference in values results in a false match where a true one would have been expected, thus flagging existing images as non-existent and resulting in another write to the database of the same images.
This discrepancy appears to be a feature of SQLAlchemy, happening during cli/src/pixl_cli/_database.py::_add_images_to_session()'s call to session.bulk_save_objects(images).
Dropping a breakpoint after that call and running session.query(Image).all() (from a fresh database, so all is what we just put in), shows images containing nan being stored to the database and retrieved as "NaN".
Presumably, this is due to the type definition of the Image.accession_number and Image.mrn being Mapped[str] and SQLAlchemy casting the nan to a valid type instead of raising an exception - I have not looked further into this, so this may be intended behaviour from SQLAlchemy, though failing to read just written data back as-was is worrisome.
I instead looked at storing the accession_number and mrn as the correct type (str) at the point of reading, in cli/src/pixl_cli/_io.py::_load_csv().
The following call reads in the csv and results in nan entries under any cell left blank in the csv: messages_df = pd.read_csv(filepath, header=0, dtype=str).
As an aside, nan, being a float, seems like an odd placeholder for empty given the dtype=str statement in the code above.
I figure that swapping all nan for "" before running any subsequent conversions makes sense, as an empty cell being considered equivalent to "" is what I would, perhaps naively, expect.
Specifically, I can confirm the following diff results in the images entering the database once only no matter how many calls to pixl populate /path/to/csv with no values given for accession_number and mrn.
diff --git a/cli/src/pixl_cli/_io.py b/cli/src/pixl_cli/_io.py
index 0eb0db3..48e9431 100644
--- a/cli/src/pixl_cli/_io.py
+++ b/cli/src/pixl_cli/_io.py
@@ -94,10 +94,11 @@ def _load_csv(filepath: Path) -> pd.DataFrame:
messages_df = pd.read_csv(filepath, header=0, dtype=str)
messages_df = _map_columns(messages_df, MAP_CSV_TO_MESSAGE_KEYS)
_raise_if_column_names_not_found(messages_df, [col.name for col in DF_COLUMNS])
+ messages_df = messages_df.replace(np.nan, "")
messages_df["series_uid"] = (
- messages_df.get("series_uid", pd.Series("")).replace(np.nan, "").str.strip()
+ messages_df.get("series_uid", pd.Series("")).str.strip()
)
- messages_df["pseudo_patient_id"] = messages_df["pseudo_patient_id"].replace(np.nan, None)
+ messages_df["pseudo_patient_id"] = messages_df["pseudo_patient_id"].replace("", None)
# Parse non string columns
messages_df["procedure_occurrence_id"] = messages_df["procedure_occurrence_id"].astype(int)I can make a PR, but first I wanted to ask if this is a good solution, or will it end up breaking some assumptions made elsewhere in the code base?
I also note that I have not looked at the code path for loading from a parquet file. Perhaps a similar consideration is needed there.
NB/ I have also tried setting blank fields to None, but this does not match the database schema that expects accession_number to be non-null.
Relevant log output (optional)
No response
What did you expect? (optional)
No response