This project contains Python scripts to download historical market, event, and price history data from the Polymarket APIs and process it into usable TSV formats.
-
download_markets.py: Fetches market data in batches based on status (e.g., closed) from the Gamma API and saves each batch as a.jsonlfile in the specified output directory.- Purpose: Downloads the initial set of market overview data.
- Output Directory: Contains files like
markets_offset_0_limit_20.jsonl. - Resume: Automatically detects the last successfully saved batch and resumes downloading from the next offset.
- Example Command:
python download_markets.py --output-dir market_data --status closed
-
download_event_details.py: Scans the market.jsonlfiles generated by the first script, extracts all unique event IDs mentioned, and then fetches the full details for each event ID using the Gamma API/events/{id}endpoint. Saves each event's details into a separate JSON file.- Purpose: Downloads detailed data for every unique event associated with the downloaded markets.
- Input Directory: The directory containing the market
.jsonlfiles (e.g.,market_data). - Output Directory: Contains files like
event_12345.json. - Resume: Checks for existing event files and only downloads details for events not already present.
- Parallelism: Uses multiple workers (default 8) to speed up downloads.
- Example Command:
python download_event_details.py --market-data-dir market_data --output-dir event_details --workers 10
-
download_price_history.py: Scans individual market detail JSON files (market_{id}.json), extracts the first CLOB token ID (assumed to be the "Yes" outcome), and fetches the price history time series for that token from the CLOB API (/prices-history). Saves the raw JSON response for each market.- Purpose: Downloads the raw price history for the "Yes" outcome of each market.
- Input Directory: Directory containing individual market detail JSON files (e.g.,
market_details- typically created by Task 1 ofprocess_data.py). Use--market-details-dir. - Output Directory: Contains files like
price_history_yes_12345.json. Use--output-dir. - Resume: Checks for existing price history files and only downloads data for markets not already present.
- Parallelism: Uses multiple workers (default 8) to speed up downloads.
- Example Command:
python download_price_history.py --market-details-dir market_details --output-dir price_history --workers 10
-
process_data.py: Processes the downloaded market, event, and price history data.- Task 1 (Optional): Saves each market from the
.jsonlfiles into an individualmarket_{id}.jsonfile for easier access (--market-output-dir). Required ifdownload_price_history.pyneeds these files as input. - Task 2: Reads the market
.jsonlfiles and the corresponding event detail JSONs. It also checks the downloaded price history files (--price-history-dir). It then creates two separate TSV files:- Markets TSV (
--market-tsv-output): Contains one row per market, with allmarket_*prefixed columns. Includes amarket_event_idscolumn (comma-separated string of event IDs) and amarket_downloaded_pricehistory_nonemptycolumn (True/False) indicating if the corresponding price history file was found and contained data. - Events TSV (
--event-tsv-output): Contains one row per unique event encountered across all processed markets, with allevent_*prefixed columns.
- Markets TSV (
- Task 3 (Optional): Reads the price history JSON files (
--price-history-dir). For each file with a non-emptyhistorylist, it creates a new TSV file (timeseries_{id}.tsv) in the specified output directory (--timeseries-output-dir) containingtimestampandpricecolumns. - Inputs: Market data directory (
--market-data-dir), event details directory (--event-details-dir), price history directory (--price-history-dir, required for Tasks 2 & 3). - Outputs: Individual market JSON directory (for Task 1), Markets TSV file, Events TSV file (for Task 2), Timeseries TSV directory (for Task 3).
- Example Command (All Tasks):
python process_data.py \ --market-data-dir market_data \ --event-details-dir event_details \ --price-history-dir price_history \ --market-output-dir market_details \ --market-tsv-output polymarket_markets.tsv \ --event-tsv-output polymarket_events.tsv \ --timeseries-output-dir timeseries_data - Example Command (Only Task 2 & 3 - TSV Creation):
python process_data.py \ --market-data-dir market_data \ --event-details-dir event_details \ --price-history-dir price_history \ --market-tsv-output polymarket_markets.tsv \ --event-tsv-output polymarket_events.tsv \ --timeseries-output-dir timeseries_data \ --skip-task1 - Example Command (Only Task 3 - Timeseries TSVs):
python process_data.py \ --market-data-dir market_data `# Still needed for arg parser even if task skipped` \ --event-details-dir event_details `# Still needed for arg parser even if task skipped` \ --price-history-dir price_history \ --timeseries-output-dir timeseries_data \ --skip-task1 --skip-task2
- Task 1 (Optional): Saves each market from the
-
analyze_price_data.py: Analyzes all price history JSON files (e.g., fromprice_history/). It calculates statistics for each file, such as the number of data points, mean price, standard deviation, time range, and timestamp delta characteristics. It also identifies potential issues like empty files, constant prices, or formatting errors.- Purpose: To assess the quality and characteristics of downloaded price history files and produce a structured summary for further processing or filtering.
- Input Directory: Assumes JSON files are in a
price_history/directory relative to where the script is run. - Outputs:
analysis_summary.txt: A human-readable text file summarizing the analysis for each file and providing global statistics across all files.analysis_results.json: A JSON file containing a list of detailed analysis dictionaries for each processed file. This file is intended for programmatic use, for example, byfilter_price_data.py.
- Parallelism: Uses multiprocessing to speed up the analysis when handling many files.
- Example Command:
python analyze_price_data.py
-
filter_price_data.py: Filters the analyzed price history data based on user-defined criteria. It reads theanalysis_results.jsonfile generated byanalyze_price_data.py.- Purpose: To select a subset of price history files that meet specific quality or characteristic thresholds.
- Input:
analysis_results.json: The JSON file output byanalyze_price_data.py.- Filtering criteria: Defined directly within the
filter_criteriadictionary in the script.
- Outputs:
- Prints a list of filenames that meet the specified criteria to the console.
filtered_filenames.txt: A text file containing the list of filenames that passed the filters, one filename per line.
- Customization: Users should modify the
filter_criteriadictionary within the script to set their desired thresholds for metrics like minimum/maximum number of data points, mean price range, standard deviation range, issues to exclude, or maximum allowed time delta between points. - Example Command:
python filter_price_data.py
- Run
download_markets.pyto fetch market data based on status (e.g., closed) from the Gamma API and save each batch as a.jsonlfile in the specified output directory. - Run
download_event_details.pyto scan the market.jsonlfiles, extract all unique event IDs mentioned, and fetch the full details for each event ID using the Gamma API/events/{id}endpoint. Save each event's details into a separate JSON file. - Run
download_price_history.pyto scan individual market detail JSON files, extract the first CLOB token ID (assumed to be the "Yes" outcome), and fetch the price history time series for that token from the CLOB API (/prices-history). Save the raw JSON response for each market. - Run
process_data.pyto process the downloaded market, event, and price history data. This script performs three tasks:- Task 1 (Optional): Saves each market from the
.jsonlfiles into an individualmarket_{id}.jsonfile for easier access. - Task 2: Reads the market
.jsonlfiles and the corresponding event detail JSONs. It also checks the downloaded price history files. It then creates two separate TSV files: a Markets TSV and an Events TSV. - Task 3 (Optional): Reads the price history JSON files and creates a new TSV file for each file with a non-empty
historylist, containingtimestampandpricecolumns.
- Task 1 (Optional): Saves each market from the
- Run
analyze_price_data.pyto analyze all price history JSON files. It calculates statistics for each file and identifies potential issues. The script produces two outputs: a human-readable text file summarizing the analysis for each file and a JSON file containing a list of detailed analysis dictionaries for each processed file. - Run
filter_price_data.pyto filter the analyzed price history data based on user-defined criteria. It reads theanalysis_results.jsonfile generated byanalyze_price_data.pyand prints a list of filenames that meet the specified criteria to the console. It also saves the list of filenames that passed the filters to a text file.
- Python 3.x
requestslibrary (pip install requests)
- During exploration we can see about 36k has non-empty time series but not all time series are valid many only have 0.5 so need some clean up.
- It seems many fields including categories are useless or empty. We can fill some of these using LLMs or other techniques.
- Check using GraphQL for better quality data or missing fields. https://thegraph.com/docs/en/subgraphs/guides/polymarket/
- Data clean up and more exploration.
- Other endpoints.