If my_address_file.csv is a file in the current working directory with an address column named address, then the DeGAUSS command:
docker run --rm -v $PWD:/tmp ghcr.io/degauss-org/geocoder:3.3.0 my_address_file.csvwill produce my_address_file_geocoder_3.3.0_score_threshold_0.5.csv with added columns:
matched_street,matched_city,matched_state,matched_zip: matched address componets (e.g.,matched_streetis the street the geocoder matched with the input address); can be used to investigate input address misspellings, typos, etc.precision: The method/precision of the geocode. The value will be one of:range: interpolated based on address ranges from street segmentsstreet: center of the matched streetintersection: intersection of two streetszip: centroid of the matched zip codecity: centroid of the matched city
score: The percentage of text match between the given address and the geocoded result, expressed as a number between 0 and 1. A higher score indicates a closer match. Note that each score is relative within a precision method (i.e. ascoreof0.8with aprecisionofrangeis not the same as ascoreof0.8with aprecisionofstreet).latandlon: geocoded coordinates for matched addressgeocode_result: A character string summarizing the geocoding result. The value will be one ofgeocoded: the address was geocoded with aprecisionof eitherrangeorstreetand ascoreof0.5or greater.imprecise_geocode: the address was geocoded, but results were suppressed because theprecisionwasintersection,zip, orcityand/or thescorewas less than0.5.po_box: the address was not geocoded because it is a PO Boxcincy_inst_foster_addr: the address was not geocoded because it is a known institutional address, not a residential addressnon_address_text: the address was not geocoded because it was blank or listed as "foreign", "verify", or "unknown"
- Geocodes with a resulting precision of
intersection,zip, orcityare returned with a missinglatandlonbecause they are likely too inaccurate and/or too imprecise to be used for further analysis. - By default,
latandlonare also returned as missing if thescoreis less than0.5(regardless of the precision). - This threshold can be changed by including an optional argument in the docker call (e.g.,
docker run --rm -v $PWD:/tmp degauss/geocoder:3.2.0 my_address_file.csv 0.6). - Supplying
allinstead of a numericscore_thresholdreturns all geocodes regardless ofscore,precision, orpo_box,cincy_inst_foster_addr, andnon_address_textfilters.
- Other columns may be present, but it is recommended to only include
addressand an optional identifier column (e.g.,id). Fewer columns will increase geocoding speed. - Address data must be in one column called
address. - Separate the different address components with a space
- Do not include apartment numbers or "second address line" (but its okay if you can't remove them)
- ZIP codes must be five digits (i.e.
32709) and not "plus four" (i.e.32709-0000) - Do not try to geocode addresses without a valid 5 digit zip code; this is used by the geocoder to complete its initial searches and if attempted, it will likely return incorrect matches
- Spelling should be as accurate as possible, but the program does complete "fuzzy matching" so an exact match is not necessary
- Capitalization does not affect results
- Abbreviations may be used (i.e.
St.instead ofStreetorOHinstead ofOhio) - Use Arabic numerals instead of written numbers (i.e.
13instead ofthirteen) - Address strings with out of order items could return NA (i.e.
3333 Burnet Ave Cincinnati 45229 OH)
geocoder.dbis a SQL database prepared following the instructions here using 2021 TIGER/Line Street Range Address files from the Census- For this container, it is hosted at
s3://geomarker/geocoder_2021.db
For detailed documentation on DeGAUSS, including general usage and installation, please see the DeGAUSS homepage.
