Refactor read file function to have constant output behavior #1210

bw4sz · 2025-11-14T17:31:15Z

To solidify the API in other parts of the codebase, we need a clear understanding of the DeepForest data model. I've written my idea in the docs:

The DeepForest data model

The DeepForest data model has three components

Annotations are stored as dataframes. Each row is an annotation with a single geometry and label. Each annotation dataframe must contain a 'image_path' and a 'label' column. The image_path is relative to the .root_dir attribute of the dataframe.
Annotation geometry is stored as a shapely object, allowing the easy movement among Point, Polygon and Box representations and geometric operations.
Annotations are expressed in image coordinates, not geographic coordinates.

Once we agree on the data model, all combinations of input data taken from read_file should arrive at this representation. This isn't currently true, with lots of edge cases. This PR ties those cases to the data model.

It involves one, very small, breaking change. The optional image_path argument was the relative path, now its the full path, so that the root_dir can be parsed from it.

read_file(input, image_path)

The alternative is not to have any breaking, and users will need specify both a root_dir and a image_path argument, even though in most cases users will need to do something silly

full_image_path = <full path to the image>
read_file(input, image_path = os.path.basename(full_image_path), root_dir = os.path.dirname(full_image_path)

I'm comfortable with this change, its a very rarely used edge case, in situations in which you have an in-memory dataset, or shapefile that lacked an image_path column. We can wait until 2.1 to push this.

ethanwhite

Looks really good. A few things to fix related to rgb_path vs image_path and some suggestions on code clarity. I tried to clean out the comments that got dropped when you pushed changes Friday afternoon, but if something looks weird it's possible there's one still hiding somewhere that I haven't found.

src/deepforest/utilities.py

docs/user_guide/01_Reading_data.md

src/deepforest/utilities.py

tests/test_utilities.py

bw4sz · 2025-11-17T23:37:47Z

I have reset and pushed a single commit.

This PR makes the read_file function more consistent. In all cases a user should end up with

A geodataframe with a geometry column
An image_path column (with one edge case below)
A label column
A root_dir attribute.

The function better partitions cases and makes it much clearer what's happenning.

The one edge case is that an in-memory dataset cannot have an image_path column, since there is no path to the image.

The breaking change is that in many cases

read_file(annotations_csv)

would have been silently allowed to pass even if it 1) didn't have an image_path column, or 2) had an image path column, but no root_dir, so no way of knowing where those images were stored.

The default for root_dir is relative to the annotations_file on disk.

bw4sz · 2025-11-17T23:57:34Z

@jveitchmichaelis I left the default as the same dir as the annotation file, instead of the current working dir. It was simpler and cleaner internally.

bw4sz · 2025-11-17T23:59:02Z

The only thing we might want to address is whether we should make the output a bespoke class

class deepforest_dataframe(gpd.GeoDataFrame):

that asserts the required structures and maybe could address the annoying filtering issue for the root_dir #1042

ethanwhite

Very nice work. I haven't read the tests yet because I have to run, but wanted to get the rest of this to you. Lots of in-line comments plus:

The label check function returns the data frame while the image_path check returns an image path. I found this difference in the behavior of the two checks confusing. The "check" naming is also a bit confusing since to my mind it would just check something not set something.

Related to this it looks like __check_image_path__ only returns a single value in the case where the optional argument isn't passed and the column is present. This seems weird since there can be multiple values in the annotation file. Might be misunderstanding something in my haste, so hope this is more helpful than not.

docs/user_guide/01_Reading_data.md

ethanwhite · 2025-11-18T01:38:57Z

docs/user_guide/01_Reading_data.md

+gdf = utilities.read_file(
+    input="/path/to/annotations.shp",
+    image_path="/path/to/OSBS_029.tif",   # required if no image_path column
+    label="Tree"                        # optional: used if no 'label' column in the shapefile
+)


Instead of the comments here I'd probably explain what the arguments due in the text above, which would go along with the clarification of single image/label. E.g.,

In cases where there is a single image file, the optional image_path argument can be used in place of providing an image_path column to specify the image file location. Likewise, if there is only a single label, the optional label argument can be used in place of providing a label column to indicate the single label for all objects.

docs/user_guide/01_Reading_data.md

src/deepforest/utilities.py

ethanwhite · 2025-11-18T12:34:46Z

src/deepforest/utilities.py

+    if not os.path.exists(full_image_path):
+        raise FileNotFoundError(
+            f"Image file {full_image_path} not found, please check the image_path argument, it should be the full path: read_file(input=df, image_path='/path/to/image.tif', ...)"
+        )


Same comment as above - also note that the issue might be in the annotations file

src/deepforest/utilities.py

ethanwhite

Finished reading the tests. They look good, but I think we should add a test where the annotations come from multiple image files. That will help us make sure both now and in the future that __check_image_path__ works properly with multi-image datasets.

…e files to pass through

bw4sz · 2025-11-18T20:12:47Z

I went back to behavior of current source, the image_path is relative, and the user has to provide basename, better to be verbose and explicit. I made a custom class to maintain the root_dir attribute, update the docs (still need to look again if everything looks right, passes locally).

bw4sz · 2025-12-01T19:52:15Z

I added back the deep copy of any incoming pandas or geopandas dataframe. Otherwise I believe this is ready.

bw4sz changed the title ~~Refactor read file function to have constant output behavior~~ [WIP] Refactor read file function to have constant output behavior Nov 14, 2025

bw4sz mentioned this pull request Nov 14, 2025

Remove width and height from plot_results and plot_annotations #1198

Open

ethanwhite requested changes Nov 17, 2025

View reviewed changes

bw4sz force-pushed the refactor_read_file branch from 55ab89e to 81bd472 Compare November 17, 2025 23:33

bw4sz mentioned this pull request Nov 17, 2025

.root_dir attribute isn't maintained when filtering results objects #1042

Open

bw4sz force-pushed the refactor_read_file branch from 81bd472 to 7f544f8 Compare November 17, 2025 23:56

bw4sz changed the title ~~[WIP] Refactor read file function to have constant output behavior~~ Refactor read file function to have constant output behavior Nov 17, 2025

Refactor read_file to be more readable, consistent and fail early

3c2c6cd

bw4sz force-pushed the refactor_read_file branch from 7f544f8 to 3c2c6cd Compare November 18, 2025 00:00

ethanwhite requested changes Nov 18, 2025

View reviewed changes

ethanwhite reviewed Nov 18, 2025

View reviewed changes

create custom class to close #1042, refactored to allow multiple imag…

ac7e93f

…e files to pass through

ethanwhite mentioned this pull request Nov 19, 2025

Numpy < 2 dependency is blocking installation of other, non-deepforest dependencies #1201

Closed

Copilot AI mentioned this pull request Nov 20, 2025

[WIP] Add minimal refactor for canonical annotation model and API henrykironde/DeepForest#13

Draft

8 tasks

Add back in deep copy

a636e29

Refactor read file function to have constant output behavior #1210

Are you sure you want to change the base?

Refactor read file function to have constant output behavior #1210

Conversation

bw4sz commented Nov 14, 2025 • edited by ethanwhite Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The DeepForest data model

Uh oh!

ethanwhite left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bw4sz commented Nov 17, 2025

Uh oh!

bw4sz commented Nov 17, 2025

Uh oh!

bw4sz commented Nov 17, 2025

Uh oh!

ethanwhite left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ethanwhite Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ethanwhite Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ethanwhite left a comment

Choose a reason for hiding this comment

Uh oh!

bw4sz commented Nov 18, 2025

Uh oh!

bw4sz commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bw4sz commented Nov 14, 2025 •

edited by ethanwhite

Loading