Skip to content

Conversation

@bw4sz
Copy link
Collaborator

@bw4sz bw4sz commented Nov 14, 2025

To solidify the API in other parts of the codebase, we need a clear understanding of the DeepForest data model. I've written my idea in the docs:

The DeepForest data model

The DeepForest data model has three components

  1. Annotations are stored as dataframes. Each row is an annotation with a single geometry and label. Each annotation dataframe must contain a 'image_path' and a 'label' column. The image_path is relative to the .root_dir attribute of the dataframe.
  2. Annotation geometry is stored as a shapely object, allowing the easy movement among Point, Polygon and Box representations and geometric operations.
  3. Annotations are expressed in image coordinates, not geographic coordinates.

Once we agree on the data model, all combinations of input data taken from read_file should arrive at this representation. This isn't currently true, with lots of edge cases. This PR ties those cases to the data model.

It involves one, very small, breaking change. The optional image_path argument was the relative path, now its the full path, so that the root_dir can be parsed from it.

read_file(input, image_path)

The alternative is not to have any breaking, and users will need specify both a root_dir and a image_path argument, even though in most cases users will need to do something silly

full_image_path = <full path to the image>
read_file(input, image_path = os.path.basename(full_image_path), root_dir = os.path.dirname(full_image_path)

I'm comfortable with this change, its a very rarely used edge case, in situations in which you have an in-memory dataset, or shapefile that lacked an image_path column. We can wait until 2.1 to push this.

@bw4sz bw4sz changed the title Refactor read file function to have constant output behavior [WIP] Refactor read file function to have constant output behavior Nov 14, 2025
Copy link
Member

@ethanwhite ethanwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. A few things to fix related to rgb_path vs image_path and some suggestions on code clarity. I tried to clean out the comments that got dropped when you pushed changes Friday afternoon, but if something looks weird it's possible there's one still hiding somewhere that I haven't found.

@bw4sz bw4sz force-pushed the refactor_read_file branch from 55ab89e to 81bd472 Compare November 17, 2025 23:33
@bw4sz
Copy link
Collaborator Author

bw4sz commented Nov 17, 2025

I have reset and pushed a single commit.

This PR makes the read_file function more consistent. In all cases a user should end up with

  • A geodataframe with a geometry column
  • An image_path column (with one edge case below)
  • A label column
  • A root_dir attribute.

The function better partitions cases and makes it much clearer what's happenning.

The one edge case is that an in-memory dataset cannot have an image_path column, since there is no path to the image.

The breaking change is that in many cases

read_file(annotations_csv)

would have been silently allowed to pass even if it 1) didn't have an image_path column, or 2) had an image path column, but no root_dir, so no way of knowing where those images were stored.

The default for root_dir is relative to the annotations_file on disk.

@bw4sz bw4sz force-pushed the refactor_read_file branch from 81bd472 to 7f544f8 Compare November 17, 2025 23:56
@bw4sz bw4sz changed the title [WIP] Refactor read file function to have constant output behavior Refactor read file function to have constant output behavior Nov 17, 2025
@bw4sz
Copy link
Collaborator Author

bw4sz commented Nov 17, 2025

@jveitchmichaelis I left the default as the same dir as the annotation file, instead of the current working dir. It was simpler and cleaner internally.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Nov 17, 2025

The only thing we might want to address is whether we should make the output a bespoke class

class deepforest_dataframe(gpd.GeoDataFrame):

that asserts the required structures and maybe could address the annoying filtering issue for the root_dir #1042

@bw4sz bw4sz force-pushed the refactor_read_file branch from 7f544f8 to 3c2c6cd Compare November 18, 2025 00:00
Copy link
Member

@ethanwhite ethanwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work. I haven't read the tests yet because I have to run, but wanted to get the rest of this to you. Lots of in-line comments plus:

The label check function returns the data frame while the image_path check returns an image path. I found this difference in the behavior of the two checks confusing. The "check" naming is also a bit confusing since to my mind it would just check something not set something.

Related to this it looks like __check_image_path__ only returns a single value in the case where the optional argument isn't passed and the column is present. This seems weird since there can be multiple values in the annotation file. Might be misunderstanding something in my haste, so hope this is more helpful than not.

Comment on lines 27 to 31
gdf = utilities.read_file(
input="/path/to/annotations.shp",
image_path="/path/to/OSBS_029.tif", # required if no image_path column
label="Tree" # optional: used if no 'label' column in the shapefile
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the comments here I'd probably explain what the arguments due in the text above, which would go along with the clarification of single image/label. E.g.,

In cases where there is a single image file, the optional image_path argument can be used in place of providing an image_path column to specify the image file location. Likewise, if there is only a single label, the optional label argument can be used in place of providing a label column to indicate the single label for all objects.

Comment on lines 211 to 214
if not os.path.exists(full_image_path):
raise FileNotFoundError(
f"Image file {full_image_path} not found, please check the image_path argument, it should be the full path: read_file(input=df, image_path='/path/to/image.tif', ...)"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above - also note that the issue might be in the annotations file

Copy link
Member

@ethanwhite ethanwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished reading the tests. They look good, but I think we should add a test where the annotations come from multiple image files. That will help us make sure both now and in the future that __check_image_path__ works properly with multi-image datasets.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Nov 18, 2025

I went back to behavior of current source, the image_path is relative, and the user has to provide basename, better to be verbose and explicit. I made a custom class to maintain the root_dir attribute, update the docs (still need to look again if everything looks right, passes locally).

@bw4sz
Copy link
Collaborator Author

bw4sz commented Dec 1, 2025

I added back the deep copy of any incoming pandas or geopandas dataframe. Otherwise I believe this is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants