Issues handling np.nan in data read from csv.

### What happened?

I am testing PIXL for use in Manchester University NHS Foundation Trust.
Where I work has a requirement that we can only query our image database (PACS) using Study Instance UIDs, and not accession number or Mrn.

I prepared some test data in an (separate) orthanc service, and a csv file to use with `pixl populate`, along with a project config, etc.
I left the "accession_number" and "mrn" columns empty in the csv as we do not want to use these, and not including the information here seemed like the best way to ensure that.

Currently no images are being exported - I am still to investigate why.

I have come across an issue I believe to be a bug.

On the third run of `pixl populate /path/to/csv`, I had an error at cli/src/pixl_cli/_database.py::_filter_exported_messages() line 102, whereby the `messages_df.merge(...)` command was failing due to non-unique index.
On inspection the data in images_df was already duplicated (populated from the first 2 runs).

The cause of multiple data entry originates at cli/src/pixl_cli/_database.py::_filter_existing_images().
Looking at messages_df_reindexed in pdb, `nan` values are used for "accession_number" and "mrn" columns.
The same columns in images_df_reindexed contain the string `"NaN"`.
This difference in values results in a false match where a true one would have been expected, thus flagging existing images as non-existent and resulting in another write to the database of the same images.

This discrepancy appears to be a feature of SQLAlchemy, happening during cli/src/pixl_cli/_database.py::_add_images_to_session()'s call to `session.bulk_save_objects(images)`.
Dropping a breakpoint after that call and running `session.query(Image).all()` (from a fresh database, so all is what we just put in), shows images containing `nan` being stored to the database and retrieved as `"NaN"`.
Presumably, this is due to the type definition of the Image.accession_number and Image.mrn being `Mapped[str]` and SQLAlchemy casting the `nan` to a valid type instead of raising an exception - I have not looked further into this, so this may be intended behaviour from SQLAlchemy, though failing to read just written data back as-was is worrisome.

I instead looked at storing the accession_number and mrn as the correct type (str) at the point of reading, in cli/src/pixl_cli/_io.py::_load_csv().
The following call reads in the csv and results in `nan` entries under any cell left blank in the csv: `messages_df = pd.read_csv(filepath, header=0, dtype=str)`.
As an aside, `nan`, being a float, seems like an odd placeholder for *empty* given the `dtype=str` statement in the code above.
I figure that swapping all `nan` for `""` before running any subsequent conversions makes sense, as an empty cell being considered equivalent to "" is what I would, perhaps naively, expect.

Specifically, I can confirm the following diff results in the images entering the database once only no matter how many calls to `pixl populate /path/to/csv` with no values given for accession_number and mrn.
```diff
diff --git a/cli/src/pixl_cli/_io.py b/cli/src/pixl_cli/_io.py
index 0eb0db3..48e9431 100644
--- a/cli/src/pixl_cli/_io.py
+++ b/cli/src/pixl_cli/_io.py
@@ -94,10 +94,11 @@ def _load_csv(filepath: Path) -> pd.DataFrame:
     messages_df = pd.read_csv(filepath, header=0, dtype=str)
     messages_df = _map_columns(messages_df, MAP_CSV_TO_MESSAGE_KEYS)
     _raise_if_column_names_not_found(messages_df, [col.name for col in DF_COLUMNS])
+    messages_df = messages_df.replace(np.nan, "")
     messages_df["series_uid"] = (
-        messages_df.get("series_uid", pd.Series("")).replace(np.nan, "").str.strip()
+        messages_df.get("series_uid", pd.Series("")).str.strip()
     )
-    messages_df["pseudo_patient_id"] = messages_df["pseudo_patient_id"].replace(np.nan, None)
+    messages_df["pseudo_patient_id"] = messages_df["pseudo_patient_id"].replace("", None)

     # Parse non string columns
     messages_df["procedure_occurrence_id"] = messages_df["procedure_occurrence_id"].astype(int)
```

I can make a PR, but first I wanted to ask if this is a good solution, or will it end up breaking some assumptions made elsewhere in the code base?
I also note that I have not looked at the code path for loading from a parquet file. Perhaps a similar consideration is needed there.

NB/ I have also tried setting blank fields to `None`, but this does not match the database schema that expects accession_number to be non-null.


### Relevant log output (optional)

_No response_

### What did you expect? (optional)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issues handling np.nan in data read from csv. #615

What happened?

Relevant log output (optional)

What did you expect? (optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issues handling np.nan in data read from csv. #615

Description

What happened?

Relevant log output (optional)

What did you expect? (optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions