Skip to content

Python-based PDF parser utilizing pypdf and Ollama for semantic categorization of regulatory logistics files.

Notifications You must be signed in to change notification settings

abhinyaay/document-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local LLM Document Parser (Logistics Edition)

A privacy-focused, local document parser designed for the logistics/trucking industry. It uses Ollama (Llama 3.2) to intelligently categorize PDF documents into specific regulatory and operational folders without sending data to the cloud.

Features

  • Local Intelligence: Uses llama3.2 running locally via Ollama. No API costs, no data privacy concerns.
  • Smart Classification: Sorts documents into 12 specific categories:
    • IFTA
    • Corporation (Auto-detects California SOI)
    • IRP, Permits, Title Transfer, Driver Files, DOT
    • Clean Truck Check, Invoices, Drug Test
    • INFO (General)
  • Manual Verification: Automatically routes ambiguous or low-confidence documents to a Manual_Verification folder for human review.
  • Extensible: Built with a modular architecture to support future expansions like Excel parsing and Vector DB integration.

Prerequisites

  • Python 3.10+
  • Ollama: Must be installed and running.
    • Download Ollama or brew install ollama
    • Pull the model: ollama pull llama3.2

Installation

  1. Clone the repository:

    git clone https://github.com/abhinyaay/document-parser.git
    cd document-parser
  2. Create and activate a virtual environment:

    python3 -m venv venv
    source venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt

Usage

  1. Start Ollama Service:

    brew services start ollama
    # OR simply run 'ollama serve' in a separate terminal
  2. Add Documents: Place your PDF files in the input/ directory.

  3. Run Parser:

    python3 parser_v2.py
  4. Check Output: Sorted files will appear in the output/ directory, organized by category.

Project Structure

  • parser_v2.py: Main entry point.
  • classifier/local_llm.py: Logic interfacing with Ollama API.
  • extractors/: Modules for reading different file formats (currently PDF).
  • input/: Drop your raw files here.
  • output/: Processed files land here.

About

Python-based PDF parser utilizing pypdf and Ollama for semantic categorization of regulatory logistics files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages