Deep Field is a web scraper used to pull play-by-play information from baseball-reference.com. The web scraper scrapes the details of each play from every available game page and writes the information into an SQLite Database. It also scrapes some barebones info on players, games, teams, and venues. The full schema is described below.
Note this database is intended to be used to aggregate your own stats. It does not pull any aggregated stats or metrics itself.
- Ensure the latest version of Python is installed.
- Clone the repo.
- From the root, run
python -m pip install -r requirements.txt
The web scraper can be invoked by running
python -m deepfield.scraper start-year [end-year] [-db database-name]
from the root. This scraper builds up a SQLite database of play-by-play information for every game in the given year range. You can issue a keyboard interrupt via Ctrl+C to end the scrape.
start-yearis the earliest year to scrape.end-yearis the latest year to scrape (inclusive). Defaults to current year.database-nameis the name of the database to generate. Defaults tostats.
Note: this scraper can take a long time to run! baseball-reference.com/robots.txt specifies a crawl delay of 3 seconds, and depending on how many years you decide to scrape, this can take anywhere from a few hours to a few days, since there are thousands of pages that need to scraped for a given season.
However, the scraper will cache the pages, so if you delete the database or reference a different name, subsequent scrapes will use the cached pages instead of requesting them via the web. This can make subsequent scrapes faster by an order of magnitude.
The database contains the following tables. Each table section contains a description of each column.
-
Field Description idAn auto-incremented unique ID. game_idThe ID of the game the play belongs to. inning_halfA number corresponding to the half of the inning the play occurred in. This starts at 0 for the top of the 1st and goes to 17 for bottom of the 9th, continuing for overtime as needed. start_outsNumber of outs at the start of the play. start_on_baseA number in the range [0, 7] corresponding to which bases were occupied at the start of the play as a bit flag; i.e. +1 for first occupied, +2 for second occupied and +4 for third occupied. play_numA 0-based index corresponding to the position that this play occurred in, relative to the game's other plays. descThe listed description for the result of the play. pitch_ctThe listed information for the starting pitch count of the play. For modern games this contains the specifics of the pitches thrown as well. For earlier games, this may be minimal or omitted entirely. batter_idID of the batter who participated in the play. pitcher_idID of the pitcher who participated in the play. -
Field Description idAn auto-incremented unique ID. nameName of the player. name_idThe name ID used by baseball-reference. This is typically the first five letters of the last name, first two letters of first name, and two unique digits. bats0 if this player bats left-handed, 1 if right-handed, and 2 if ambidextrous. throws0 if this player throws left-handed, 1 if right-handed, and 2 if ambidextrous. -
Field Description idAn auto-incremented unique ID. name_idThe name ID used by baseball-reference. This is typically the home team's three letter abbreviation, followed by the game date's year, month, and day, followed by a unique digit. local_start_timeThe local start time of the game in the venue's timezone. time_of_day0 if the game was played during the day, 1 if at night. field_type0 if the game was played on turf, 1 if on grass. dateThe date the game was played on. venue_idID of the venue the game was played in. home_team_idID of the home team. away_team_idID of the away team. -
Field Description idAn auto-incremented unique ID. nameName of the team. abbreviationAbbreviation for the team. -
Field Description idAn auto-incremented unique ID. nameName of the venue.