weecology
diff --git a/‎docs/user_guide/01_Reading_data.md‎
Lines changed: 49 additions & 27 deletions b/‎docs/user_guide/01_Reading_data.md‎
Lines changed: 49 additions & 27 deletions
diff --git a/‎src/deepforest/main.py‎
Lines changed: 9 additions & 9 deletions b/‎src/deepforest/main.py‎
Lines changed: 9 additions & 9 deletions
@@ -1,30 +1,44 @@
 # Reading in Data
 
-The most time-consuming part of many open-source projects is getting the data in and out. This is because there are so many formats and ways a user might interact with the package. DeepForest has collated many use cases into a single `read_file` function that will attempt to read many common data formats, both projected and unprojected, and create a dataframe ready for DeepForest functions.
+The most time-consuming part of many open-source projects is getting the data in and out. This is because there are so many formats and ways a user might interact with the package.
 
-You can also optionally provide:
-  - `image_path`: A single image path to assign to all annotations in the input. This is useful when the input contains annotations for only one image.
-  - `label`: A single label to apply to all rows. This is helpful when all annotations share the same label (e.g., "Tree").
+## The DeepForest data model
+
+The DeepForest data model has three components
+
+1. Annotations are stored as dataframes. Each row is an annotation with a single geometry and label. Each annotation dataframe must contain a 'image_path', which is the relative, not full path to the image, and a 'label' column.
+2. Annotation geometry is stored as a shapely object, allowing the easy movement among Point, Polygon and Box representations.
+3. Annotations are expressed in image coordinates, not geographic coordinates. There are utilities to convert geospatial data (.shp, .gpkg) to DeepForest data formats.
+
+## The read_file function
+DeepForest has collated many use cases into a single `read_file` function that will read many common data formats, both projected and unprojected, and create a dataframe ready for DeepForest functions that fits the DeepForest data model.
+
+### Example 1: A csv file containing box annotations.
 
-Example:
 ```
 from deepforest import utilities
 
-df = utilities.read_file("annotations.csv", image_path="OSBS_029.tif", label="Tree")
+df = utilities.read_file("annotations.csv", image_path="<full path to the image>", label="Tree")
 ```
 
-**Note:** If your input file contains multiple image filenames and you do not provide the `image_path` argument, a warning may appear:
+For files that lack an `image_path` or `label` column, pass the `image_path` or `label` argument.
 
+```python
+from deepforest import utilities
+
+gdf = utilities.read_file(
+    input="/path/to/annotations.shp",
+    image_path="/path/to/OSBS_029.tif",   # required if no image_path column
+    label="Tree"                        # optional: used if no 'label' column in the shapefile
+)
 ```
-UserWarning: Multiple image filenames found. This may cause issues if the file paths are not correctly specified.
-```
-To avoid this, consider providing a single `image_path` argument if all annotations belong to the same image.
 
 At a high level, `read_file` will:
 
 1. Check the file extension to determine the format.
-2. Read the file into a pandas dataframe.
-3. Append the location of the image directory as an attribute.
+2. Read and convert the file into a GeoPandas dataframe.
+3. Append the location of the image directory as a 'root_dir' attribute.
+4. If input data is a geospatial object, such as a shapefile, convert geographic coordinates to image coordinates based on the coordinate reference system (CRS) and resolution of the image.
 
 Allows for the following formats:
 
@@ -34,21 +48,6 @@ Allows for the following formats:
 - COCO (`.json`)
 - Pascal VOC (`.xml`)
 
-## Annotation Geometries and Coordinate Systems
-
-DeepForest was originally designed for bounding box annotations. As of DeepForest 1.4.0, point and polygon annotations are also supported. There are two ways to format annotations, depending on the annotation platform you are using. `read_file` can read points, polygons, and boxes, in both image coordinate systems (relative to image origin at top-left 0,0) as well as projected coordinates on the Earth's surface. The `read_file` method also appends the location of the current image directory as an attribute. To access this attribute use the `root_dir` attribute.
-
-```python
-from deepforest import get_data
-from deepforest import utilities
-
-filename = get_data("OSBS_029.csv")
-df = utilities.read_file(filename)
-df.root_dir
-```
-
-**Note:** For CSV files, coordinates are expected to be in the image coordinate system, not projected coordinates (such as latitude/longitude or UTM).
-
 ### Boxes
 
 #### CSV
@@ -140,6 +139,29 @@ shp = utilities.read_file(input="/path/to/boxes_shapefile.shp")
 shp.head()
 ```
 
+If your shapefile does not include an `image_path` column, you must provide the raster path via `img_path`:
+
+```python
+from deepforest import utilities
+
+shp = utilities.read_file(
+    input="/path/to/boxes_shapefile.shp",
+    image_path="/path/to/OSBS_029.tif"
+)
+```
+
+If your shapefile also lacks a `label` column, you can assign one for all rows:
+
+```python
+from deepforest import utilities
+
+shp = utilities.read_file(
+    input="/path/to/boxes_shapefile.shp",
+    image_path="/path/to/OSBS_029.tif",
+    label="Tree"
+)
+```
+
 Example output:
 
 ```
 
@@ -446,7 +446,6 @@ def predict_image(self, image: np.ndarray | None = None, path: str | None = None
             result["label"] = result.label.apply(lambda x: self.numeric_to_label_dict[x])
 
         if path is None:
-            result = utilities.read_file(result)
             warnings.warn(
                 "An image was passed directly to predict_image, the result.root_dir attribute "
                 "will be None in the output dataframe, to use visualize.plot_results, "
@@ -621,7 +620,7 @@ def predict_tile(
             root_dir = os.path.dirname(paths[0])
         else:
             print(
-                "No image path provided, root_dir will be None, since either "
+                "No image path provided, root_dir of the output results dataframe will be None, since either "
                 "images were directly provided or there were multiple image paths"
             )
             root_dir = None
@@ -643,7 +642,8 @@ def predict_tile(
         else:
             cropmodel_results = mosaic_results
 
-        formatted_results = utilities.read_file(cropmodel_results, root_dir=root_dir)
+        formatted_results = utilities.__pandas_to_geodataframe__(cropmodel_results)
+        formatted_results.root_dir = root_dir
 
         return formatted_results
 
@@ -906,7 +906,7 @@ def predict_batch(self, images, preprocess_fn=None):
                 continue
             geom_type = utilities.determine_geometry_type(pred)
             result = utilities.format_geometry(pred, geom_type=geom_type)
-            results.append(utilities.read_file(result))
+            results.append(result)
 
         return results
 
@@ -997,16 +997,16 @@ def evaluate(
             dict: Results dictionary containing precision, recall and other metrics
         """
         self.model.eval()
-        ground_df = utilities.read_file(csv_file)
-        ground_df["label"] = ground_df.label.apply(lambda x: self.label_dict[x])
-
         if root_dir is None:
-            root_dir = os.path.dirname(csv_file)
+            root_dir = self.config.validation.root_dir
+
+        ground_df = utilities.read_file(csv_file, root_dir=root_dir)
+        ground_df["label"] = ground_df.label.apply(lambda x: self.label_dict[x])
 
         if predictions is None:
             # Get the predict dataloader and use predict_batch
             predictions = self.predict_file(
-                csv_file, root_dir, size=size, batch_size=batch_size
+                csv_file, ground_df.root_dir, size=size, batch_size=batch_size
             )
 
         if iou_threshold is None: