Skip to content

ercas/geotag.py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

geotag.py

Geocode flat CSV data using various shapefiles

screenshot

About

geotag.py implements the Geotagger class which can do very fast geotagging of big data in real time with live feedback through the use of R-tree spatial indices (via rtree) and tqdm (a progress bar library).

This functionality can be useful by allowing big data to be geotagged prior to the analysis process to create crosswalks that can then later be joined, rather than having to do a costly spatial join in the middle of the analysis process.

Other useful features include:

  • Automatically validates input geometries and corrects invalid geometries via Shapely's buffer(0) routine (more info here).
    • Note that it is still preferred to do this manually, if possible, as this may lead to unexpected results
  • Automatically drops null coordinates.
  • Supports globbed inputs, e.g. tl_2020*bg.shp$GEOID>geoid_bg - all matched paths will be separately read and concatenated into a single GeoDataFrame before proceeding.

Usage

geotag.py can be used either embedded in a script or as a command-line application.

For information on command-line usage, see geotag.py --help. An example can be seen below using the example data found in this repository:

$ python3 geotag.py \
>     --input example.csv \
>     --longitude lon \
>     --latitude lat \
>     --output example-geotagged.csv \
>     --rownames-only \
>     --verbose \
>     'zip://tl_2010_25_zcta510.zip$GEOID10>zcta_10'
Reading input file: example.csv
Generating rownames and dropping other columns

(1/1) Operation: zip://tl_2010_25_zcta510.zip$GEOID10>zcta_10
Reading: zip://tl_2010_25_zcta510.zip
Creating rtree index: 100%|███████████| 538/538 [00:00<00:00, 9753.82 indexed/s]
Geotagging: creating column "zcta_10" <- zip://tl_2010_25_zcta510.zip$GEOID10
100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 640.30it/s]

Dropping ['lon', 'lat'] (as per -r/--rownames-only)
Writing to: example-geotagged.csv

Embedding geotag.py is also straightforward:

>>> import pandas, geotag, geopandas
>>> df = pandas.read_csv("example.csv")
>>> gdf = geopandas.read_file("zip://tl_2010_25_zcta510.zip")
>>> tagger = Geotagger(gdf, "GEOID10")
>>> df["zcta_10"] = df[["lon", "lat"]].apply(
...     lambda xy: tagger.lookup(*xy), axis = 1
... )
>>> df.to_csv("example-geotagged.csv", index=False)

Limitations / future directions

Specifying the coordinate reference system (CRS) of the input CSV file is currently not supported; inputs are assumed to be in WGS 84 (EPSG:4326). Support for alternative CRSs would be valuable e.g. for survey points which may use a different CRS.

CRS transformations are not currently supported. Manual and automatic transformations may be added in the future, though this may be beyond the scope of this project.

The routines in the Geotagger object could theoretically be parallelized through a MapReduce-like programming pattern for faster geotagging, but the memory usage would increase proportinally with the number of cores being used. For this reason, parallelization has not yet been implemented.

Currently, only CSV input is supported. Support for other geospatial data files e.g. shapefiles or GeoJSONs as inputs may be added in the future, though this could complicate the Geotagger class

In the future, the current libraries may also be replaced with more performant alternatives, or CPU-intensive parts reimplemented in a faster language like C++.

Contributions welcome!

About

fast, interactive geocoding of big data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages