A privacy-focused, local document parser designed for the logistics/trucking industry. It uses Ollama (Llama 3.2) to intelligently categorize PDF documents into specific regulatory and operational folders without sending data to the cloud.
- Local Intelligence: Uses
llama3.2running locally via Ollama. No API costs, no data privacy concerns. - Smart Classification: Sorts documents into 12 specific categories:
- IFTA
- Corporation (Auto-detects California SOI)
- IRP, Permits, Title Transfer, Driver Files, DOT
- Clean Truck Check, Invoices, Drug Test
- INFO (General)
- Manual Verification: Automatically routes ambiguous or low-confidence documents to a
Manual_Verificationfolder for human review. - Extensible: Built with a modular architecture to support future expansions like Excel parsing and Vector DB integration.
- Python 3.10+
- Ollama: Must be installed and running.
- Download Ollama or
brew install ollama - Pull the model:
ollama pull llama3.2
- Download Ollama or
-
Clone the repository:
git clone https://github.com/abhinyaay/document-parser.git cd document-parser -
Create and activate a virtual environment:
python3 -m venv venv source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Start Ollama Service:
brew services start ollama # OR simply run 'ollama serve' in a separate terminal -
Add Documents: Place your PDF files in the
input/directory. -
Run Parser:
python3 parser_v2.py
-
Check Output: Sorted files will appear in the
output/directory, organized by category.
parser_v2.py: Main entry point.classifier/local_llm.py: Logic interfacing with Ollama API.extractors/: Modules for reading different file formats (currently PDF).input/: Drop your raw files here.output/: Processed files land here.