Skip to content

Commit 81bd472

Browse files
committed
Refactor read_file to be more readable, consistent and fail early
1 parent 6c2f987 commit 81bd472

File tree

5 files changed

+346
-247
lines changed

5 files changed

+346
-247
lines changed

docs/user_guide/01_Reading_data.md

Lines changed: 49 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,44 @@
11
# Reading in Data
22

3-
The most time-consuming part of many open-source projects is getting the data in and out. This is because there are so many formats and ways a user might interact with the package. DeepForest has collated many use cases into a single `read_file` function that will attempt to read many common data formats, both projected and unprojected, and create a dataframe ready for DeepForest functions.
3+
The most time-consuming part of many open-source projects is getting the data in and out. This is because there are so many formats and ways a user might interact with the package.
44

5-
You can also optionally provide:
6-
- `image_path`: A single image path to assign to all annotations in the input. This is useful when the input contains annotations for only one image.
7-
- `label`: A single label to apply to all rows. This is helpful when all annotations share the same label (e.g., "Tree").
5+
## The DeepForest data model
6+
7+
The DeepForest data model has three components
8+
9+
1. Annotations are stored as dataframes. Each row is an annotation with a single geometry and label. Each annotation dataframe must contain a 'image_path', which is the relative, not full path to the image, and a 'label' column.
10+
2. Annotation geometry is stored as a shapely object, allowing the easy movement among Point, Polygon and Box representations.
11+
3. Annotations are expressed in image coordinates, not geographic coordinates. There are utilities to convert geospatial data (.shp, .gpkg) to DeepForest data formats.
12+
13+
## The read_file function
14+
DeepForest has collated many use cases into a single `read_file` function that will read many common data formats, both projected and unprojected, and create a dataframe ready for DeepForest functions that fits the DeepForest data model.
15+
16+
### Example 1: A csv file containing box annotations.
817

9-
Example:
1018
```
1119
from deepforest import utilities
1220
13-
df = utilities.read_file("annotations.csv", image_path="OSBS_029.tif", label="Tree")
21+
df = utilities.read_file("annotations.csv", image_path="<full path to the image>", label="Tree")
1422
```
1523

16-
**Note:** If your input file contains multiple image filenames and you do not provide the `image_path` argument, a warning may appear:
24+
For files that lack an `image_path` or `label` column, pass the `image_path` or `label` argument.
1725

26+
```python
27+
from deepforest import utilities
28+
29+
gdf = utilities.read_file(
30+
input="/path/to/annotations.shp",
31+
image_path="/path/to/OSBS_029.tif", # required if no image_path column
32+
label="Tree" # optional: used if no 'label' column in the shapefile
33+
)
1834
```
19-
UserWarning: Multiple image filenames found. This may cause issues if the file paths are not correctly specified.
20-
```
21-
To avoid this, consider providing a single `image_path` argument if all annotations belong to the same image.
2235

2336
At a high level, `read_file` will:
2437

2538
1. Check the file extension to determine the format.
26-
2. Read the file into a pandas dataframe.
27-
3. Append the location of the image directory as an attribute.
39+
2. Read and convert the file into a GeoPandas dataframe.
40+
3. Append the location of the image directory as a 'root_dir' attribute.
41+
4. If input data is a geospatial object, such as a shapefile, convert geographic coordinates to image coordinates based on the coordinate reference system (CRS) and resolution of the image.
2842

2943
Allows for the following formats:
3044

@@ -34,21 +48,6 @@ Allows for the following formats:
3448
- COCO (`.json`)
3549
- Pascal VOC (`.xml`)
3650

37-
## Annotation Geometries and Coordinate Systems
38-
39-
DeepForest was originally designed for bounding box annotations. As of DeepForest 1.4.0, point and polygon annotations are also supported. There are two ways to format annotations, depending on the annotation platform you are using. `read_file` can read points, polygons, and boxes, in both image coordinate systems (relative to image origin at top-left 0,0) as well as projected coordinates on the Earth's surface. The `read_file` method also appends the location of the current image directory as an attribute. To access this attribute use the `root_dir` attribute.
40-
41-
```python
42-
from deepforest import get_data
43-
from deepforest import utilities
44-
45-
filename = get_data("OSBS_029.csv")
46-
df = utilities.read_file(filename)
47-
df.root_dir
48-
```
49-
50-
**Note:** For CSV files, coordinates are expected to be in the image coordinate system, not projected coordinates (such as latitude/longitude or UTM).
51-
5251
### Boxes
5352

5453
#### CSV
@@ -140,6 +139,29 @@ shp = utilities.read_file(input="/path/to/boxes_shapefile.shp")
140139
shp.head()
141140
```
142141

142+
If your shapefile does not include an `image_path` column, you must provide the raster path via `img_path`:
143+
144+
```python
145+
from deepforest import utilities
146+
147+
shp = utilities.read_file(
148+
input="/path/to/boxes_shapefile.shp",
149+
image_path="/path/to/OSBS_029.tif"
150+
)
151+
```
152+
153+
If your shapefile also lacks a `label` column, you can assign one for all rows:
154+
155+
```python
156+
from deepforest import utilities
157+
158+
shp = utilities.read_file(
159+
input="/path/to/boxes_shapefile.shp",
160+
image_path="/path/to/OSBS_029.tif",
161+
label="Tree"
162+
)
163+
```
164+
143165
Example output:
144166

145167
```

src/deepforest/main.py

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -446,7 +446,6 @@ def predict_image(self, image: np.ndarray | None = None, path: str | None = None
446446
result["label"] = result.label.apply(lambda x: self.numeric_to_label_dict[x])
447447

448448
if path is None:
449-
result = utilities.read_file(result)
450449
warnings.warn(
451450
"An image was passed directly to predict_image, the result.root_dir attribute "
452451
"will be None in the output dataframe, to use visualize.plot_results, "
@@ -621,7 +620,7 @@ def predict_tile(
621620
root_dir = os.path.dirname(paths[0])
622621
else:
623622
print(
624-
"No image path provided, root_dir will be None, since either "
623+
"No image path provided, root_dir of the output results dataframe will be None, since either "
625624
"images were directly provided or there were multiple image paths"
626625
)
627626
root_dir = None
@@ -643,7 +642,8 @@ def predict_tile(
643642
else:
644643
cropmodel_results = mosaic_results
645644

646-
formatted_results = utilities.read_file(cropmodel_results, root_dir=root_dir)
645+
formatted_results = utilities.__pandas_to_geodataframe__(cropmodel_results)
646+
formatted_results.root_dir = root_dir
647647

648648
return formatted_results
649649

@@ -906,7 +906,7 @@ def predict_batch(self, images, preprocess_fn=None):
906906
continue
907907
geom_type = utilities.determine_geometry_type(pred)
908908
result = utilities.format_geometry(pred, geom_type=geom_type)
909-
results.append(utilities.read_file(result))
909+
results.append(result)
910910

911911
return results
912912

@@ -997,16 +997,16 @@ def evaluate(
997997
dict: Results dictionary containing precision, recall and other metrics
998998
"""
999999
self.model.eval()
1000-
ground_df = utilities.read_file(csv_file)
1001-
ground_df["label"] = ground_df.label.apply(lambda x: self.label_dict[x])
1002-
10031000
if root_dir is None:
1004-
root_dir = os.path.dirname(csv_file)
1001+
root_dir = self.config.validation.root_dir
1002+
1003+
ground_df = utilities.read_file(csv_file, root_dir=root_dir)
1004+
ground_df["label"] = ground_df.label.apply(lambda x: self.label_dict[x])
10051005

10061006
if predictions is None:
10071007
# Get the predict dataloader and use predict_batch
10081008
predictions = self.predict_file(
1009-
csv_file, root_dir, size=size, batch_size=batch_size
1009+
csv_file, ground_df.root_dir, size=size, batch_size=batch_size
10101010
)
10111011

10121012
if iou_threshold is None:

0 commit comments

Comments
 (0)