Skip to content

amirharati/polymarket_data

Repository files navigation

Polymarket Data Downloader & Processor

This project contains Python scripts to download historical market, event, and price history data from the Polymarket APIs and process it into usable TSV formats.

Scripts

  1. download_markets.py: Fetches market data in batches based on status (e.g., closed) from the Gamma API and saves each batch as a .jsonl file in the specified output directory.

    • Purpose: Downloads the initial set of market overview data.
    • Output Directory: Contains files like markets_offset_0_limit_20.jsonl.
    • Resume: Automatically detects the last successfully saved batch and resumes downloading from the next offset.
    • Example Command:
      python download_markets.py --output-dir market_data --status closed
  2. download_event_details.py: Scans the market .jsonl files generated by the first script, extracts all unique event IDs mentioned, and then fetches the full details for each event ID using the Gamma API /events/{id} endpoint. Saves each event's details into a separate JSON file.

    • Purpose: Downloads detailed data for every unique event associated with the downloaded markets.
    • Input Directory: The directory containing the market .jsonl files (e.g., market_data).
    • Output Directory: Contains files like event_12345.json.
    • Resume: Checks for existing event files and only downloads details for events not already present.
    • Parallelism: Uses multiple workers (default 8) to speed up downloads.
    • Example Command:
      python download_event_details.py --market-data-dir market_data --output-dir event_details --workers 10
  3. download_price_history.py: Scans individual market detail JSON files (market_{id}.json), extracts the first CLOB token ID (assumed to be the "Yes" outcome), and fetches the price history time series for that token from the CLOB API (/prices-history). Saves the raw JSON response for each market.

    • Purpose: Downloads the raw price history for the "Yes" outcome of each market.
    • Input Directory: Directory containing individual market detail JSON files (e.g., market_details - typically created by Task 1 of process_data.py). Use --market-details-dir.
    • Output Directory: Contains files like price_history_yes_12345.json. Use --output-dir.
    • Resume: Checks for existing price history files and only downloads data for markets not already present.
    • Parallelism: Uses multiple workers (default 8) to speed up downloads.
    • Example Command:
      python download_price_history.py --market-details-dir market_details --output-dir price_history --workers 10
  4. process_data.py: Processes the downloaded market, event, and price history data.

    • Task 1 (Optional): Saves each market from the .jsonl files into an individual market_{id}.json file for easier access (--market-output-dir). Required if download_price_history.py needs these files as input.
    • Task 2: Reads the market .jsonl files and the corresponding event detail JSONs. It also checks the downloaded price history files (--price-history-dir). It then creates two separate TSV files:
      • Markets TSV (--market-tsv-output): Contains one row per market, with all market_* prefixed columns. Includes a market_event_ids column (comma-separated string of event IDs) and a market_downloaded_pricehistory_nonempty column (True/False) indicating if the corresponding price history file was found and contained data.
      • Events TSV (--event-tsv-output): Contains one row per unique event encountered across all processed markets, with all event_* prefixed columns.
    • Task 3 (Optional): Reads the price history JSON files (--price-history-dir). For each file with a non-empty history list, it creates a new TSV file (timeseries_{id}.tsv) in the specified output directory (--timeseries-output-dir) containing timestamp and price columns.
    • Inputs: Market data directory (--market-data-dir), event details directory (--event-details-dir), price history directory (--price-history-dir, required for Tasks 2 & 3).
    • Outputs: Individual market JSON directory (for Task 1), Markets TSV file, Events TSV file (for Task 2), Timeseries TSV directory (for Task 3).
    • Example Command (All Tasks):
      python process_data.py \
          --market-data-dir market_data \
          --event-details-dir event_details \
          --price-history-dir price_history \
          --market-output-dir market_details \
          --market-tsv-output polymarket_markets.tsv \
          --event-tsv-output polymarket_events.tsv \
          --timeseries-output-dir timeseries_data
    • Example Command (Only Task 2 & 3 - TSV Creation):
      python process_data.py \
          --market-data-dir market_data \
          --event-details-dir event_details \
          --price-history-dir price_history \
          --market-tsv-output polymarket_markets.tsv \
          --event-tsv-output polymarket_events.tsv \
          --timeseries-output-dir timeseries_data \
          --skip-task1
    • Example Command (Only Task 3 - Timeseries TSVs):
      python process_data.py \
          --market-data-dir market_data `# Still needed for arg parser even if task skipped` \
          --event-details-dir event_details `# Still needed for arg parser even if task skipped` \
          --price-history-dir price_history \
          --timeseries-output-dir timeseries_data \
          --skip-task1 --skip-task2
  5. analyze_price_data.py: Analyzes all price history JSON files (e.g., from price_history/). It calculates statistics for each file, such as the number of data points, mean price, standard deviation, time range, and timestamp delta characteristics. It also identifies potential issues like empty files, constant prices, or formatting errors.

    • Purpose: To assess the quality and characteristics of downloaded price history files and produce a structured summary for further processing or filtering.
    • Input Directory: Assumes JSON files are in a price_history/ directory relative to where the script is run.
    • Outputs:
      • analysis_summary.txt: A human-readable text file summarizing the analysis for each file and providing global statistics across all files.
      • analysis_results.json: A JSON file containing a list of detailed analysis dictionaries for each processed file. This file is intended for programmatic use, for example, by filter_price_data.py.
    • Parallelism: Uses multiprocessing to speed up the analysis when handling many files.
    • Example Command:
      python analyze_price_data.py
  6. filter_price_data.py: Filters the analyzed price history data based on user-defined criteria. It reads the analysis_results.json file generated by analyze_price_data.py.

    • Purpose: To select a subset of price history files that meet specific quality or characteristic thresholds.
    • Input:
      • analysis_results.json: The JSON file output by analyze_price_data.py.
      • Filtering criteria: Defined directly within the filter_criteria dictionary in the script.
    • Outputs:
      • Prints a list of filenames that meet the specified criteria to the console.
      • filtered_filenames.txt: A text file containing the list of filenames that passed the filters, one filename per line.
    • Customization: Users should modify the filter_criteria dictionary within the script to set their desired thresholds for metrics like minimum/maximum number of data points, mean price range, standard deviation range, issues to exclude, or maximum allowed time delta between points.
    • Example Command:
      python filter_price_data.py

Data Flow

  1. Run download_markets.py to fetch market data based on status (e.g., closed) from the Gamma API and save each batch as a .jsonl file in the specified output directory.
  2. Run download_event_details.py to scan the market .jsonl files, extract all unique event IDs mentioned, and fetch the full details for each event ID using the Gamma API /events/{id} endpoint. Save each event's details into a separate JSON file.
  3. Run download_price_history.py to scan individual market detail JSON files, extract the first CLOB token ID (assumed to be the "Yes" outcome), and fetch the price history time series for that token from the CLOB API (/prices-history). Save the raw JSON response for each market.
  4. Run process_data.py to process the downloaded market, event, and price history data. This script performs three tasks:
    • Task 1 (Optional): Saves each market from the .jsonl files into an individual market_{id}.json file for easier access.
    • Task 2: Reads the market .jsonl files and the corresponding event detail JSONs. It also checks the downloaded price history files. It then creates two separate TSV files: a Markets TSV and an Events TSV.
    • Task 3 (Optional): Reads the price history JSON files and creates a new TSV file for each file with a non-empty history list, containing timestamp and price columns.
  5. Run analyze_price_data.py to analyze all price history JSON files. It calculates statistics for each file and identifies potential issues. The script produces two outputs: a human-readable text file summarizing the analysis for each file and a JSON file containing a list of detailed analysis dictionaries for each processed file.
  6. Run filter_price_data.py to filter the analyzed price history data based on user-defined criteria. It reads the analysis_results.json file generated by analyze_price_data.py and prints a list of filenames that meet the specified criteria to the console. It also saves the list of filenames that passed the filters to a text file.

Requirements

  • Python 3.x
  • requests library (pip install requests)

Notes

  1. During exploration we can see about 36k has non-empty time series but not all time series are valid many only have 0.5 so need some clean up.
  2. It seems many fields including categories are useless or empty. We can fill some of these using LLMs or other techniques.

TODOs

  1. Check using GraphQL for better quality data or missing fields. https://thegraph.com/docs/en/subgraphs/guides/polymarket/
  2. Data clean up and more exploration.
  3. Other endpoints.

Relevant links

  1. https://github.com/DominiqueBuob/polymarket_analysis_v1?tab=readme-ov-file
  2. https://goldsky.com/blog/polymarket-dataset
  3. https://github.com/Mr-Slope/Polymarket-Autocorrelation/tree/main
  4. https://docs.polymarket.com/

About

fetch polymarket data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published