We're tired of reading. The meatbag wetware is slow. We will make the words speak. This project is the first step: carving up the monolithic PDF tomes into digestible chapters, ready for the next phase of alchemical transformation into audio.
To handle the beautiful chaos of PDFs from different eras, the script uses a guided, interactive process. It combines automated style analysis with human-in-the-loop confirmation to ensure an accurate result every time.
The workflow is as follows:
-
(⇌) Forensic Analysis: When you run the script on a PDF, it first performs a deep analysis of the document's typography. It groups all potential headings by their font size and style (e.g., "Size 12, ALL CAPS") and prints this analysis for you to see.
-
(⊕) The Proposal: The script then makes an educated guess at which of these groups represents the true chapters of the document. It uses this guess to generate and display a clear, human-readable Split Plan, showing you exactly which files it intends to create from which page ranges.
-
(🔥) User Confirmation: Finally, the script pauses and asks for your explicit confirmation with a
[y/N]prompt. The Great Cleaving only proceeds if you give the final command.
This script craves the sanctuary of a virtual environment. The use of uv is recommended.
-
Create the virtual environment:
uv venv
-
Activate the environment:
source .venv/bin/activate -
Install dependencies from the requirements file:
uv pip install -r requirements.txt
The script is an interactive tool that will guide you through the process.
-
Make the script executable (first time only):
chmod +x pdf_chapter_harvester.py
-
Run the script on your target file:
./pdf_chapter_harvester.py /path/to/your/document.pdf
The script will show you its analysis and a proposed split plan. Review the plan, and if you are satisfied, type `y` and press Enter to begin the split.
If you trust the script's suggestion, you can add the -y or --yes flag to skip the interactive confirmation and proceed with the split automatically.
./pdf_chapter_harvester.py /path/to/your/document.pdf -y- Specify an output directory (optional):
By default, the cleaved chapter files are saved in a directory named
chapters. You can specify a different location with the-oflag../pdf_chapter_harvester.py /path/to/your/document.pdf -o /path/to/your/output_folder