Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 49 additions & 27 deletions docs/user_guide/01_Reading_data.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,44 @@
# Reading in Data

The most time-consuming part of many open-source projects is getting the data in and out. This is because there are so many formats and ways a user might interact with the package. DeepForest has collated many use cases into a single `read_file` function that will attempt to read many common data formats, both projected and unprojected, and create a dataframe ready for DeepForest functions.
## The DeepForest data model

You can also optionally provide:
- `image_path`: A single image path to assign to all annotations in the input. This is useful when the input contains annotations for only one image.
- `label`: A single label to apply to all rows. This is helpful when all annotations share the same label (e.g., "Tree").
The DeepForest data model has four components:

1. Annotations are stored as dataframes. Each row is an annotation with a single geometry and label. Each annotation dataframe must contain a 'image_path', which is the basename, not full path to the image, and a 'label' column.
2. Annotation geometry is stored as a shapely object, allowing the easy movement among Point, Polygon and Box representations.
3. Annotations are expressed in image coordinates, not geographic coordinates. There are utilities to convert geospatial data (.shp, .gpkg) to DeepForest data formats.
4. A root_dir attribute that specifies where the images are stored. A Dee

## The read_file function
DeepForest has collated many use cases into a single `read_file` function that will read many common data formats, both projected and unprojected, and create a dataframe ready for DeepForest functions that fits the DeepForest data model.

### Example 1: A csv file containing box annotations.

Example:
```
from deepforest import utilities

df = utilities.read_file("annotations.csv", image_path="OSBS_029.tif", label="Tree")
df = utilities.read_file("annotations.csv", root_dir="directory containing images", image_path="relative path to the image>", label="Tree")
```

**Note:** If your input file contains multiple image filenames and you do not provide the `image_path` argument, a warning may appear:
For files that lack an `image_path` or `label` column, pass the `image_path` or `label` argument. This applies the same image_path and label for the entire file, and is not appropriate for multi-image files.

```python
from deepforest import utilities

gdf = utilities.read_file(
input="/path/to/annotations.shp",
image_path="OSBS_029.tif", # required if no image_path column
root_dir="path/to/images/" # required is image_path argument is used
label="Tree" # optional: used if no 'label' column in the shapefile
)
```
UserWarning: Multiple image filenames found. This may cause issues if the file paths are not correctly specified.
```
To avoid this, consider providing a single `image_path` argument if all annotations belong to the same image.

At a high level, `read_file` will:

1. Check the file extension to determine the format.
2. Read the file into a pandas dataframe.
3. Append the location of the image directory as an attribute.
2. Read and convert the file into a GeoPandas dataframe.
3. Append the location of the image directory as a 'root_dir' attribute.
4. If input data is a geospatial object, such as a shapefile, convert geographic coordinates to image coordinates based on the coordinate reference system (CRS) and resolution of the image.

Allows for the following formats:

Expand All @@ -34,21 +48,6 @@ Allows for the following formats:
- COCO (`.json`)
- Pascal VOC (`.xml`)

## Annotation Geometries and Coordinate Systems

DeepForest was originally designed for bounding box annotations. As of DeepForest 1.4.0, point and polygon annotations are also supported. There are two ways to format annotations, depending on the annotation platform you are using. `read_file` can read points, polygons, and boxes, in both image coordinate systems (relative to image origin at top-left 0,0) as well as projected coordinates on the Earth's surface. The `read_file` method also appends the location of the current image directory as an attribute. To access this attribute use the `root_dir` attribute.

```python
from deepforest import get_data
from deepforest import utilities

filename = get_data("OSBS_029.csv")
df = utilities.read_file(filename)
df.root_dir
```

**Note:** For CSV files, coordinates are expected to be in the image coordinate system, not projected coordinates (such as latitude/longitude or UTM).

### Boxes

#### CSV
Expand Down Expand Up @@ -140,6 +139,29 @@ shp = utilities.read_file(input="/path/to/boxes_shapefile.shp")
shp.head()
```

If your shapefile does not include an `image_path` column, you must provide the raster path via `img_path`:

```python
from deepforest import utilities

shp = utilities.read_file(
input="/path/to/boxes_shapefile.shp",
image_path="/path/to/OSBS_029.tif"
)
```

If your shapefile also lacks a `label` column, you can assign one for all rows:

```python
from deepforest import utilities

shp = utilities.read_file(
input="/path/to/boxes_shapefile.shp",
image_path="/path/to/OSBS_029.tif",
label="Tree"
)
```

Example output:

```
Expand Down
24 changes: 14 additions & 10 deletions src/deepforest/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -446,7 +446,6 @@ def predict_image(self, image: np.ndarray | None = None, path: str | None = None
result["label"] = result.label.apply(lambda x: self.numeric_to_label_dict[x])

if path is None:
result = utilities.read_file(result)
warnings.warn(
"An image was passed directly to predict_image, the result.root_dir attribute "
"will be None in the output dataframe, to use visualize.plot_results, "
Expand Down Expand Up @@ -621,7 +620,7 @@ def predict_tile(
root_dir = os.path.dirname(paths[0])
else:
print(
"No image path provided, root_dir will be None, since either "
"No image path provided, root_dir of the output results dataframe will be None, since either "
"images were directly provided or there were multiple image paths"
)
root_dir = None
Expand All @@ -643,7 +642,8 @@ def predict_tile(
else:
cropmodel_results = mosaic_results

formatted_results = utilities.read_file(cropmodel_results, root_dir=root_dir)
formatted_results = utilities.__pandas_to_geodataframe__(cropmodel_results)
formatted_results.root_dir = root_dir

return formatted_results

Expand Down Expand Up @@ -906,7 +906,7 @@ def predict_batch(self, images, preprocess_fn=None):
continue
geom_type = utilities.determine_geometry_type(pred)
result = utilities.format_geometry(pred, geom_type=geom_type)
results.append(utilities.read_file(result))
results.append(result)

return results

Expand Down Expand Up @@ -997,16 +997,16 @@ def evaluate(
dict: Results dictionary containing precision, recall and other metrics
"""
self.model.eval()
ground_df = utilities.read_file(csv_file)
ground_df["label"] = ground_df.label.apply(lambda x: self.label_dict[x])

if root_dir is None:
root_dir = os.path.dirname(csv_file)
root_dir = self.config.validation.root_dir

ground_df = utilities.read_file(csv_file, root_dir=root_dir)
ground_df["label"] = ground_df.label.apply(lambda x: self.label_dict[x])

if predictions is None:
# Get the predict dataloader and use predict_batch
predictions = self.predict_file(
csv_file, root_dir, size=size, batch_size=batch_size
csv_file, ground_df.root_dir, size=size, batch_size=batch_size
)

if iou_threshold is None:
Expand All @@ -1031,7 +1031,11 @@ def __evaluation_logs__(self, results):
"""Log metrics from evaluation results."""
# Log metrics
for key, value in results.items():
if type(value) in [pd.DataFrame, gpd.GeoDataFrame]:
if type(value) in [
pd.DataFrame,
gpd.GeoDataFrame,
utilities.DeepForest_DataFrame,
]:
pass
elif value is None:
pass
Expand Down
Loading
Loading