-
Notifications
You must be signed in to change notification settings - Fork 224
Refactor read file function to have constant output behavior #1210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ethanwhite
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. A few things to fix related to rgb_path vs image_path and some suggestions on code clarity. I tried to clean out the comments that got dropped when you pushed changes Friday afternoon, but if something looks weird it's possible there's one still hiding somewhere that I haven't found.
55ab89e to
81bd472
Compare
|
I have reset and pushed a single commit. This PR makes the read_file function more consistent. In all cases a user should end up with
The function better partitions cases and makes it much clearer what's happenning. The one edge case is that an in-memory dataset cannot have an image_path column, since there is no path to the image. The breaking change is that in many cases would have been silently allowed to pass even if it 1) didn't have an image_path column, or 2) had an image path column, but no root_dir, so no way of knowing where those images were stored. The default for root_dir is relative to the annotations_file on disk. |
81bd472 to
7f544f8
Compare
|
@jveitchmichaelis I left the default as the same dir as the annotation file, instead of the current working dir. It was simpler and cleaner internally. |
|
The only thing we might want to address is whether we should make the output a bespoke class that asserts the required structures and maybe could address the annoying filtering issue for the root_dir #1042 |
7f544f8 to
3c2c6cd
Compare
ethanwhite
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work. I haven't read the tests yet because I have to run, but wanted to get the rest of this to you. Lots of in-line comments plus:
The label check function returns the data frame while the image_path check returns an image path. I found this difference in the behavior of the two checks confusing. The "check" naming is also a bit confusing since to my mind it would just check something not set something.
Related to this it looks like __check_image_path__ only returns a single value in the case where the optional argument isn't passed and the column is present. This seems weird since there can be multiple values in the annotation file. Might be misunderstanding something in my haste, so hope this is more helpful than not.
| gdf = utilities.read_file( | ||
| input="/path/to/annotations.shp", | ||
| image_path="/path/to/OSBS_029.tif", # required if no image_path column | ||
| label="Tree" # optional: used if no 'label' column in the shapefile | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the comments here I'd probably explain what the arguments due in the text above, which would go along with the clarification of single image/label. E.g.,
In cases where there is a single image file, the optional
image_pathargument can be used in place of providing animage_pathcolumn to specify the image file location. Likewise, if there is only a single label, the optionallabelargument can be used in place of providing alabelcolumn to indicate the single label for all objects.
src/deepforest/utilities.py
Outdated
| if not os.path.exists(full_image_path): | ||
| raise FileNotFoundError( | ||
| f"Image file {full_image_path} not found, please check the image_path argument, it should be the full path: read_file(input=df, image_path='/path/to/image.tif', ...)" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above - also note that the issue might be in the annotations file
ethanwhite
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished reading the tests. They look good, but I think we should add a test where the annotations come from multiple image files. That will help us make sure both now and in the future that __check_image_path__ works properly with multi-image datasets.
…e files to pass through
|
I went back to behavior of current source, the image_path is relative, and the user has to provide basename, better to be verbose and explicit. I made a custom class to maintain the root_dir attribute, update the docs (still need to look again if everything looks right, passes locally). |
|
I added back the deep copy of any incoming pandas or geopandas dataframe. Otherwise I believe this is ready. |
To solidify the API in other parts of the codebase, we need a clear understanding of the DeepForest data model. I've written my idea in the docs:
The DeepForest data model
The DeepForest data model has three components
Once we agree on the data model, all combinations of input data taken from read_file should arrive at this representation. This isn't currently true, with lots of edge cases. This PR ties those cases to the data model.
It involves one, very small, breaking change. The optional image_path argument was the relative path, now its the full path, so that the root_dir can be parsed from it.
The alternative is not to have any breaking, and users will need specify both a root_dir and a image_path argument, even though in most cases users will need to do something silly
I'm comfortable with this change, its a very rarely used edge case, in situations in which you have an in-memory dataset, or shapefile that lacked an image_path column. We can wait until 2.1 to push this.