Skip to content

letslego/cagvault

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CagVault

A Cache-Augmented Generation (CAG) application for private, local document chat using large language models with intelligent document parsing, LanceDB-backed persistent storage, and credit agreement analysis capabilities.

πŸš€ Quick Start (5 Minutes)

# 1. Install prerequisites
brew install ollama
brew services start ollama

# 2. Clone and setup
git clone https://github.com/letslego/cagvault.git
cd cagvault
python3.12 -m venv .venv312
source .venv312/bin/activate
pip install -e .

# 3. Download LLM model (choose one based on your RAM)
# Default (16GB RAM): Qwen3-14B
ollama pull hf.co/unsloth/Qwen3-14B-GGUF:Q4_K_XL
# Or for best quality (64GB+ RAM): DeepSeek V3
# ollama pull deepseek-ai/DeepSeek-V3
# Or lightweight (8GB RAM): Llama 3.1 8B
# ollama pull llama3.1:8b

# 4. Start the app
streamlit run app.py
# Open http://localhost:8501 in your browser

# 5. Upload a PDF and start chatting!

First-Time Tips:

  • Upload a credit agreement PDF to see section analysis in action
  • Try "πŸ’‘ Suggested Questions" after parsing completes
  • Explore the "Sections" tab to see hierarchical structure
  • Use "Agentic Search" for intelligent query understanding

🎯 What's New (December 2025)

πŸ€– Agentic RAG System:

  • 🧠 Multi-Step Reasoning: Agent understands intent, selects strategy, validates answers
  • 🎯 5 Retrieval Strategies: Semantic, Keyword, Hybrid, Agentic, Entity-based (auto-selected)
  • βœ… Self-Reflection: Optional answer validation with confidence scoring
  • πŸ“Š Full Transparency: Complete reasoning traces showing agent's thought process
  • πŸŽ“ Smart Strategy Selection: Automatically chooses best approach based on query type
  • πŸ”§ Claude Agent SDK Integration: 6 specialized MCP tools built with Agent SDK:
    • 🌐 Web Search: Fetch current data from external sources (@tool decorator)
    • 🏷️ Entity Extraction: Extract dates, amounts, names, organizations (NER-based)
    • πŸ“Š Section Ranking: Prioritize important sections using credit analyst criteria
    • πŸ”— Cross-Document Relationships: Find references, amendments, guarantees
    • πŸ” Fact Verification: Validate claims against web sources
    • πŸ’‘ Follow-Up Suggestions: Intelligent next-question recommendations

Storage Architecture Upgrade:

  • πŸ—„οΈ LanceDB Embedded Database: Replaced Redis with LanceDB for all persistent storage
  • ⚑ In-Process Caching: 3-second TTL DataFrame cache for sub-millisecond reads
  • πŸ” Full-Text Search: Built-in FTS indexes on content, titles, and questions
  • πŸ“¦ Zero External Dependencies: No separate database server required - all data in ./lancedb
  • πŸ”„ Redis Migration Tool: One-time utility to import existing Redis data
  • πŸ”’ ACID Compliance: Reliable transactions with automatic cache invalidation

Enhanced PDF Intelligence:

  • πŸ”¬ LLM-Powered Section Analysis: Parallel processing with credit analyst classification and importance scoring
  • πŸ“Š Smart Section Extraction: Hierarchical document structure with page-accurate tracking
  • πŸ” Multi-Modal Search: Keyword, semantic, and agentic (Claude-powered) search within documents
  • 🏷️ Named Entity Recognition: Extract and index parties, dates, amounts, and legal terms
  • πŸ“Œ Referenced Section Display: Automatically expand cited sections in chat responses

Intelligent Caching System:

  • πŸ’Ύ Q&A Cache: LanceDB-backed answer caching per document with persistent storage
  • πŸ“š Question Library: Track popular questions by category with autocomplete suggestions
  • ⚑ KV-Cache Optimization: 10-40x faster multi-turn conversations
  • πŸ“ˆ Cache Analytics: Real-time statistics and per-document cache management

Credit Agreement Features:

  • πŸ“‹ Document Classification: Automatic detection of covenants, defaults, and key provisions
  • 🎯 Section Importance Scoring: AI-driven relevance analysis for credit analysts
  • πŸ”— Cross-Reference Detection: Track dependencies between sections
  • πŸ“„ Page-Accurate Citations: Precise page ranges for every section

What is Cache-Augmented Generation (CAG)?

Based on the paper "Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks" (WWW '25), CAG is an alternative paradigm to traditional Retrieval-Augmented Generation (RAG) that leverages the extended context capabilities of modern LLMs.

CAG vs RAG: Visual Comparison

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     TRADITIONAL RAG WORKFLOW                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  User Query                                                                  β”‚
β”‚      β”‚                                                                       β”‚
β”‚      β–Ό                                                                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚  β”‚  Retriever       │◄─────────  Search Index    β”‚  ⏱️  LATENCY             β”‚
β”‚  β”‚  (BM25/Dense)    β”‚         β”‚  (Large DB)      β”‚                         β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β”‚           β”‚                                                                  β”‚
β”‚           β”‚ Retrieved Documents                                             β”‚
β”‚           β–Ό                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                       β”‚
β”‚  β”‚ Generator (LLM)  β”‚  ⚠️  Risk of:                                         β”‚
β”‚  β”‚                  β”‚      β€’ Missing relevant docs                          β”‚
β”‚  β”‚  (Generate Ans)  β”‚      β€’ Ranking errors                                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β€’ Search failures                               β”‚
β”‚           β”‚                                                                  β”‚
β”‚           β–Ό                                                                  β”‚
β”‚      Answer                                                                  β”‚
β”‚                                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               CACHE-AUGMENTED GENERATION (CAG) WORKFLOW                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                              β”‚
β”‚  β”Œβ”€β”€β”€ SETUP PHASE (One-time) ─────────────────────────────────────────┐   β”‚
β”‚  β”‚                                                                    β”‚   β”‚
β”‚  β”‚  All Documents                                                     β”‚   β”‚
β”‚  β”‚      β”‚                                                             β”‚   β”‚
β”‚  β”‚      β–Ό                                                             β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                             β”‚   β”‚
β”‚  β”‚  β”‚  LLM Processor   β”‚  Populate LanceDB Cache                     β”‚   β”‚
β”‚  β”‚  β”‚  (Batch Process) β”‚  (Sections + Q&A store)                     β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                             β”‚   β”‚
β”‚  β”‚           β”‚                                                        β”‚   β”‚
β”‚  β”‚           β–Ό                                                        β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                         β”‚   β”‚
β”‚  β”‚  β”‚  Cached LanceDB Storeβ”‚  πŸ’Ύ  Embedded on Disk                  β”‚   β”‚
β”‚  β”‚  β”‚  (Ready to use)      β”‚                                         β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                         β”‚   β”‚
β”‚  β”‚             β”‚                                                     β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                β”‚                                                           β”‚
β”‚  β”Œβ”€β”€β”€ INFERENCE PHASE (Fast) ────────────────────────────────────────┐   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  User Query        LanceDB Cache                                 β”‚   β”‚
β”‚  β”‚      β”‚                  β”‚                                        β”‚   β”‚
β”‚  β”‚      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                        β”‚   β”‚
β”‚  β”‚                 β–Ό                                                β”‚   β”‚
β”‚  β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                 β”‚   β”‚
β”‚  β”‚        β”‚  LLM + LanceDB Hits  β”‚  ✨ LOCAL RETRIEVAL!            β”‚   β”‚
β”‚  β”‚        β”‚  (Context + cache)   β”‚  ✨ LOW LATENCY!               β”‚   β”‚
β”‚  β”‚        β”‚                      β”‚  ✨ GUARANTEED CONTEXT!        β”‚   β”‚
β”‚  β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                 β”‚   β”‚
β”‚  β”‚                   β”‚                                              β”‚   β”‚
β”‚  β”‚                   β–Ό                                              β”‚   β”‚
β”‚  β”‚              Answer (Instant)                                    β”‚   β”‚
β”‚  β”‚                                                                  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                           β”‚
β”‚  β”Œβ”€β”€β”€ MULTI-TURN OPTIMIZATION ───────────────────────────────────────┐   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  For next query: Simply truncate and reuse cached knowledge     β”‚   β”‚
β”‚  β”‚  (No need to reprocess documents)                              β”‚   β”‚
β”‚  β”‚                                                                 β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Workflow Phases

1. Preload Phase (One-time setup)

  • All relevant documents are loaded into the LLM's extended context window
  • The model processes the entire knowledge base at once

2. Cache Phase (Offline computation)

  • The model's key-value (KV) cache is precomputed and stored
  • This cache encapsulates the inference state of the LLM with all knowledge
  • No additional computation needed for each query

3. Inference Phase (Fast queries)

  • User queries are appended to the preloaded context
  • The model uses the cached parameters to generate responses directly
  • No retrieval step needed β†’ Instant answers

4. Reset Phase (Multi-turn optimization)

  • For new queries, the cache is efficiently truncated and reused
  • The preloaded knowledge remains available without reprocessing

Advantages

  • βœ… Zero Retrieval Latency: No real-time document search
  • βœ… Unified Context: Holistic understanding of all documents
  • βœ… Simplified Architecture: Single model, no retriever integration
  • βœ… Eliminates Retrieval Errors: All relevant information is guaranteed to be available
  • βœ… Perfect for Constrained Knowledge Bases: Ideal when all documents fit in context window

πŸ—οΈ Architecture Overview

CagVault now runs as a local agentic stack that combines Streamlit UI, Claude Agent SDK tools, and LanceDB-backed storage.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              Browser                               β”‚
β”‚                   Streamlit UI (app.py)                            β”‚
β”‚  - Chat with reasoning trace and skill tags                        β”‚
β”‚  - Upload/parse PDFs and manage caches                             β”‚
β”‚  - Question library + sections/entities explorer                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚ questions, uploads, actions
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           Agent Brain                              β”‚
β”‚  Router: question classifier + skill inference                     β”‚
β”‚  Planner: chooses cached answer, retrieval, or tool use            β”‚
β”‚  Reasoner: Claude/Ollama models with reflection                    β”‚
β”‚  Tools (Claude Agent SDK via MCP):                                 β”‚
β”‚    β€’ web_search β€’ entity_extractor β€’ section_ranker                β”‚
β”‚    β€’ cross_doc_links β€’ fact_verifier β€’ followup_suggester          β”‚
β”‚  Skills: PDF parser, TOC/NER search, credit analyst prompts,       β”‚
β”‚          knowledge-base skill registry                             β”‚
β”‚  Caches: Q&A cache (LanceDB), question library (LanceDB),          β”‚
β”‚          in-memory DataFrame cache                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚ retrieval + storage calls
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Storage and Engines                           β”‚
β”‚  LanceDB (embedded): doc_sections, qa_cache, question_library      β”‚
β”‚  Search: full-text, semantic, agentic rerank, entity filters       β”‚
β”‚  Runtimes: Ollama models, CAG MCP server hosting the tools         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Flows:

  • Upload/Parse β†’ LanceDB: PDFs run through Docling + LLM section analysis, saved to doc_sections with entities and TOC metadata.
  • Ask β†’ Router β†’ Cache (default mode): Questions first check LanceDB Q&A cache/question library before invoking the LLM.
  • Retrieval/Tools: When needed, the agent retrieves sections from LanceDB or calls MCP tools (web, entity, ranking, cross-doc, verification, follow-ups).
  • Answering: Responses stream with reasoning trace, cited sections, and the skills/tools used for transparency.
  • Persistence: All storage is local (LanceDB + optional caches); no cloud services are required.

Execution Modes:

  • Default (LanceDB Chat): Uses LanceDB retrieval plus Q&A cache and question library for fast local answers. No MCP tools or multi-step agent planning are invoked.
  • Agentic RAG Mode (toggle in UI): Adds planning, strategy selection, and MCP tools (web search, entities, ranking, cross-doc, fact check, follow-ups). This path currently bypasses the LanceDB Q&A cache for answers.

Knowledge Base Skills:

  • Skills live locally in knowledge-base/ and are inferred by lightweight keyword heuristics. They are rendered with each answer for transparency and kept private on-disk (see .gitignore).

Core Features

πŸ”’ Privacy & Security

  • Fully Local & Private: No API keys, cloud services, or internet required (except Redis optional)
  • Document Control: All processing happens on your machine
  • Optional Redis: Can run fully in-memory without Redis for maximum privacy

πŸ“„ Intelligent Document Processing

  • Enhanced PDF Parsing: Using Docling with LLM-powered section analysis
  • Multi-Format Support: PDF, TXT, MD files and web URLs
  • Hierarchical Structure: Automatic detection of sections, subsections, and tables
  • Named Entity Recognition: Extract parties, dates, monetary amounts, and legal terms
  • Page-Accurate Tracking: Precise page ranges for every section

πŸ” Advanced Search Capabilities

  • Keyword Search: Fast full-text search across all sections
  • Semantic Search: AI-powered similarity matching
  • Agentic Search: Claude-driven intelligent query understanding with reasoning
  • Entity Filtering: Search by PARTY, DATE, MONEY, AGREEMENT, or PERCENTAGE

πŸ’Ύ Intelligent Caching System

  • Q&A Cache: Redis-backed answer caching with automatic deduplication
  • Question Library: Track popular questions organized by 15+ categories
  • KV-Cache Optimization: 10-40x faster multi-turn conversations
  • Cache Analytics: Real-time statistics and granular cache management
  • Document-Specific Caching: Per-document cache with TTL management

πŸ’¬ Enhanced Chat Experience

  • Streaming Responses: Real-time generation with thinking process visibility
  • Referenced Sections: Auto-expand cited sections in answers
  • Suggested Questions: Category-based question recommendations
  • Autocomplete Search: Type-ahead suggestions from question library
  • Multi-Document Context: Chat across multiple documents simultaneously

οΏ½ Voice Features (Optional)

  • Speech-to-Text (STT): Record questions via microphone using OpenAI's Whisper API
  • Text-to-Speech (TTS): Synthesize answers to audio using pyttsx3 (local synthesis)
  • Voice Input: Ask questions hands-free, ideal for multitasking
  • Voice Output: Listen to answers while reviewing documents
  • Configurable Settings: Adjust recording duration, speech rate, and volume
  • Privacy: Local TTS synthesis; Whisper STT is API-based but can be disabled

�🎯 Credit Agreement Analysis

  • Section Classification: Automatic identification of COVENANTS, DEFAULTS, DEFINITIONS, etc.
  • Importance Scoring: AI-driven relevance analysis for credit analysts
  • Cross-Reference Tracking: Detect dependencies between sections
  • Covenant Analysis: Specialized understanding of debt agreements and financial covenants

🧠 Extended Context & Performance

  • Large Context Windows: Leverages Qwen3-14B's 8K+ token capacity
  • Concurrent Request Handling: 4 parallel LLM workers for simultaneous requests
  • Parallel Processing: Concurrent LLM calls for faster document analysis (4 workers)
  • Smart Page Estimation: Word-based calculation for instant section mapping
  • Memory Management: In-memory section store with LanceDB persistence
  • Connection Pooling: Optimized Ollama connections with timeout management

πŸ“Š Data Lineage Tracking (New!)

The system now includes OpenLineage-compliant data lineage tracking to monitor your document processing pipeline:

Access the Dashboard

Click the "πŸ“Š Lineage" button in the top-right corner of the app, or navigate to:

http://localhost:8501/lineage_dashboard

Dashboard Features

Monitor end-to-end data flow from document ingestion β†’ embedding β†’ retrieval β†’ LLM response:

Dashboard Views:

  1. Overview (default): Total events, assets, success rate, and operation breakdown with visualizations
  2. Events Timeline: Chronological event log with filtering by operation type
  3. Asset Lineage: Trace complete data flow for any document or section
  4. Performance Analysis: Duration distribution, trends over time, and slowest operations

Tracked Operations

The system automatically tracks:

  • πŸ“„ ingest: Document ingestion (PDF files with metadata)
  • πŸ“‘ extract_section: Section extraction from documents
  • πŸ”’ embed: Embedding generation from sections (1024 dimensions)
  • πŸ—„οΈ store_lancedb: Storage in LanceDB vector database
  • πŸ” retrieve: Retrieval from LanceDB (with cache hit tracking)
  • πŸ€– llm_response: LLM-generated answers with model info

Key Metrics

  • ⏱️ Operation duration (milliseconds)
  • πŸ“Š Total events and unique assets
  • βœ… Success rate and status breakdown
  • πŸ“ˆ Operation counts and average durations
  • πŸ”— Complete data flow lineage for any asset

How to Use

  1. Upload a Document: When you upload a PDF, all processing steps are automatically tracked
  2. Ask Questions: Retrievals and LLM responses are tracked in real-time
  3. View Dashboard: Click "πŸ“Š Lineage" button to see live metrics and visualizations
  4. Analyze Performance: Check the Performance Analysis view for bottlenecks

Technical Details

  • Storage: SQLite database at .cache/lineage.db (local, private)
  • Overhead: ~5-10ms per tracked operation (minimal impact)
  • Standard: OpenLineage-compliant metadata format
  • Dependencies: None (uses built-in SQLite + Plotly)
  • No External Services: All data stored locally

See documentation/DATA_LINEAGE_GUIDE.md for detailed API documentation and documentation/LINEAGE_IMPLEMENTATION.md for implementation details.

Prerequisites

Required

  • macOS (or Linux/Windows with appropriate package managers)
  • Python 3.12.x or 3.14.x
  • Homebrew (for macOS)
  • At least 10GB free disk space (for the LLM model)
  • 16GB RAM recommended (8GB minimum for 7B models)

Data Storage

  • LanceDB (included with dependencies)
    • Embedded vector database for persistent storage
    • No separate installation or server required
    • Automatically stores: Q&A cache, question library, parsed document sections
    • Database location: ./lancedb directory
    • Migration tool available for existing Redis users (see below)

Installation

1. Clone the Repository

git clone https://github.com/letslego/cagvault.git
cd cagvault

2. Set Up Python Environment

Create a Python 3.12 virtual environment:

python3.12 -m venv .venv312
source .venv312/bin/activate

3. Install Dependencies

Install all required Python packages:

pip install -e .

This will install:

  • streamlit - Web UI framework
  • langchain-core, langchain-ollama, langchain-groq, langchain-community - LLM orchestration
  • docling - Document conversion library
  • Other dependencies (see pyproject.toml)

4. Install and Start Ollama

Ollama is a local LLM inference server that runs models entirely on your machine.

macOS Installation:

brew install ollama
brew services start ollama

Verify Ollama is running:

ollama list

Linux Installation:

curl -fsSL https://ollama.com/install.sh | sh
ollama serve &

Windows Installation:

Download and run the installer from ollama.com/download

5. Download LLM Models

CagVault supports multiple high-performance models optimized for RAG and document understanding. Choose based on your hardware and performance needs.

Quick Start: Download Essential Models (Recommended)

Download 3 essential models covering all use cases (~30GB):

./download_essential_models.sh

This installs:

  • Qwen3-14B (Default) - Best balance of quality and speed
  • Llama 3.1 8B (Lightweight) - Fast responses, low memory
  • Phi-4 (Efficient) - Microsoft's optimized model

Download All Models

To download all 10 supported models (~200GB):

./download_models.sh

⚠️ Warning: This downloads 200GB+ and takes 2-4 hours

Manual Model Selection

Alternatively, download individual models:

Recommended Models for RAG/Document Analysis

DeepSeek V3 (Recommended for Best Quality) - 685B parameters, state-of-the-art reasoning:

ollama pull deepseek-ai/DeepSeek-V3

Requires: 64GB+ RAM, Apple Silicon M3 Max or similar

Qwen3-14B (Default) - Excellent balance of quality and speed:

ollama pull hf.co/unsloth/Qwen3-14B-GGUF:Q4_K_XL

Requires: 16GB+ RAM

DeepSeek R1 - Advanced reasoning for complex credit agreement queries:

ollama pull deepseek-ai/DeepSeek-R1

Requires: 32GB+ RAM

Mistral Large - Excellent long-context performance:

ollama pull mistral-large-latest

Requires: 32GB+ RAM

Command R+ - Cohere's RAG-optimized model:

ollama pull command-r-plus:latest

Requires: 32GB+ RAM

Lightweight Models (8GB-16GB RAM)

Phi-4 - Microsoft's efficient 14B model:

ollama pull phi4:latest

Llama 3.1 8B - Fast and lightweight:

ollama pull llama3.1:8b

Mistral Small - Quick responses for simpler queries:

ollama pull mistral-small-latest

High-End Models (32GB+ RAM)

Llama 3.3 70B - Strong reasoning:

ollama pull llama3.3:70b

Gemma 2 27B - Google's reasoning model:

ollama pull gemma2:27b

Switching Models

Option 1: Use the UI (Recommended)

  1. Start the app: streamlit run app.py
  2. Open the sidebar
  3. Expand "πŸ€– Model Settings"
  4. Select your preferred model from the dropdown
  5. Click "πŸ”„ Restart App" to apply

The UI shows RAM requirements and speed for each model to help you choose.

Option 2: Edit Config File

Edit config.py directly:

class Config:
    MODEL = DEEPSEEK_V3  # Change from QWEN_3 to any model above
    OLLAMA_CONTEXT_WINDOW = 8192

Available model constants:

  • QWEN_3 (default) - Qwen3-14B
  • DEEPSEEK_V3 - DeepSeek V3 (best quality)
  • DEEPSEEK_R1 - DeepSeek R1 (advanced reasoning)
  • MISTRAL_LARGE - Mistral Large
  • MISTRAL_SMALL - Mistral Small
  • LLAMA_3_3_70B - Llama 3.3 70B
  • LLAMA_3_1_8B - Llama 3.1 8B
  • PHI_4 - Phi-4
  • GEMMA_2_27B - Gemma 2 27B
  • COMMAND_R_PLUS - Command R+

Browsing Available Models

# List installed models
ollama list

# Search for models on Ollama library
ollama search deepseek
ollama search mistral
ollama search llama3

# Pull any model
ollama pull <model-name>

6. (Optional) Migrate from Redis

If you have existing data in Redis, you can migrate it to LanceDB:

# In Python console or script
from lancedb_cache import get_lancedb_store
import redis

# Connect to your Redis instance
redis_client = redis.from_url("redis://localhost:6379/0")

# Migrate all data (documents, Q&A cache, question library)
store = get_lancedb_store()
store.migrate_from_redis(redis_client)

print("Migration complete! Redis data imported to LanceDB.")

Note: After migration, you can optionally remove Redis. LanceDB is now the default persistent storage and requires no separate server.

7. (Optional) Enable Voice Features

Voice features allow speech-to-text input and text-to-speech output.

Option A: Fully Open Source (Recommended)

Use local Whisper models - no API keys needed, 100% offline:

# Fast, optimized local Whisper (recommended)
pip install pyttsx3 sounddevice soundfile faster-whisper

# OR standard local Whisper
pip install pyttsx3 sounddevice soundfile openai-whisper

Advantages:

  • βœ… Completely free and open source
  • βœ… Works offline (no internet required)
  • βœ… No API keys or usage limits
  • βœ… Privacy-focused (data stays local)

Model sizes (faster-whisper or openai-whisper):

  • tiny - Fastest, least accurate (~75MB)
  • base - Good balance (default, ~145MB)
  • small - Better accuracy (~466MB)
  • medium - High accuracy (~1.5GB)
  • large - Best accuracy (~3GB)

Option B: OpenAI Whisper API (Cloud)

If you prefer cloud-based STT with API:

pip install pyttsx3 sounddevice soundfile openai

# Set API key
export OPENAI_API_KEY="your-api-key-here"

Note: Both options use pyttsx3 for TTS (already open source and local).

8. Verify Installation

Check that everything is installed correctly:

# Python environment
python --version  # Should show 3.12.x or 3.14.x

# Ollama service
ollama list  # Should show your downloaded models

# Python packages
pip list | grep -E "(streamlit|langchain|docling|lancedb)"

# Optional voice features (if installed)
pip list | grep -E "(pyttsx3|sounddevice|openai)"

Running the Application

Start the Streamlit App

With your virtual environment activated:

streamlit run app.py

The application will open in your browser at http://localhost:8504

Using the Application

1. Upload Documents

Via File Upload:

  • Click the file uploader in the sidebar
  • Select PDF, TXT, or MD files
  • Watch the enhanced parsing process with section extraction
  • View parsing statistics: pages, sections, entities found

Via URL:

  • Paste a web URL in the text input
  • Click "Add Web Page" to scrape and convert to text

From LanceDB:

  • Click "πŸ—„οΈ Documents in LanceDB" expander
  • Select a previously parsed document
  • Click "Load for chat" to restore from persistent storage

2. Explore Document Structure

Sections Tab:

  • Browse hierarchical document structure
  • View page ranges, word counts, and table indicators
  • Click to expand section content
  • See coverage statistics (page distribution)

Search Tab:

  • Agentic Search: Claude-powered intelligent search with reasoning
  • Keyword Search: Fast full-text search with match counts
  • Semantic Search: AI similarity matching with relevance scores

Entities Tab:

  • Filter by type: MONEY, DATE, PARTY, AGREEMENT, PERCENTAGE
  • Click entities to see source sections
  • Track key document information

3. Ask Questions

Direct Input:

  • Type your question in the chat input at the bottom
  • Press Enter to submit

Suggested Questions:

  • Click "πŸ’‘ Suggested Questions" to see popular queries
  • Click any suggestion to instantly ask it
  • Questions are categorized: Definitions, Parties, Financial, etc.

Browse by Category:

  • Click "πŸ“š Browse by Category"
  • Explore questions organized by 15+ categories
  • View document-specific or global questions

4. Review Responses

Chat Messages:

  • Thinking Process: Expand "CAG's thoughts" to see reasoning
  • Streaming Answers: Watch responses generate in real-time
  • Cache Indicator: "πŸ’Ύ Using cached response" shows when answers are cached

Referenced Sections:

  • Automatically expands sections cited in the answer
  • Click section expanders to view full content
  • Includes page ranges and section metadata

Cache Status:

  • Green "πŸ’Ύ Using cached response" = instant retrieval from LanceDB
  • No indicator = fresh LLM generation + automatic caching to LanceDB

Voice Output (Optional):

  • Click "πŸ”Š Speak" button next to assistant responses
  • Hear synthesized answers while reviewing documents
  • Adjust speech rate and volume in sidebar "🎀 Voice Features"
  • Perfect for hands-free operation or accessibility

5. Use Voice Features (Optional)

Voice Input:

  • Click "πŸŽ™οΈ Record Question" to start recording via microphone
  • Audio is recorded locally (privacy-first)
  • Click "πŸ“ Transcribe" to convert speech to text using Whisper
  • Transcribed question is automatically submitted for analysis

Voice Output:

  • Click "πŸ”Š Speak" below any assistant answer
  • Answer text is synthesized to audio using local TTS
  • Adjust settings in sidebar (speech rate, volume)
  • Audio plays inline with playback controls

Configuration (Sidebar):

  • Expand "🎀 Voice Features" section
  • Toggle "πŸŽ™οΈ Voice Input" to enable recording
  • Toggle "πŸ”Š Voice Output" to enable synthesis
  • Adjust recording duration (5-60 seconds)
  • Adjust speech rate (50-300 words per minute)
  • Adjust output volume (0.0-1.0)

Requirements:

  • Voice Input: Requires OPENAI_API_KEY (uses Whisper API)
  • Voice Output: Works offline with pyttsx3 (no API needed)
  • Both: Requires audio recording libraries (sounddevice, soundfile)

6. Manage Caches

Cache Stats (Sidebar):

  • View total contexts, tokens, and cache hits
  • Clear all context cache with "🧹 Clear Cache"

Q&A Cache Management:

  • View cached Q&A pairs per document
  • Browse questions with thinking and responses
  • Clear per-document cache or all Q&A cache
  • Persistent storage in LanceDB (no memory limits)

Question Library:

  • Search library with autocomplete
  • View usage counts and categories
  • Delete individual questions
  • Clear entire library

Project Structure

cagvault/
β”œβ”€β”€ app.py                          # Streamlit UI with enhanced features
β”œβ”€β”€ config.py                       # Model configuration and settings
β”œβ”€β”€ models.py                       # LLM factory (Ollama/Groq)
β”œβ”€β”€ knowledge.py                    # Document loading and conversion
β”œβ”€β”€ chatbot.py                      # Chat logic with streaming and prompts
β”œβ”€β”€ kvcache.py                      # KV-Cache manager for context caching
β”œβ”€β”€ lancedb_cache.py                # LanceDB storage layer with in-process cache
β”œβ”€β”€ qa_cache.py                     # LanceDB-backed Q&A caching system
β”œβ”€β”€ question_library.py             # Question library with categorization
β”œβ”€β”€ voice_features.py               # Speech-to-text and text-to-speech (optional)
β”œβ”€β”€ simple_cag.py                   # Simplified CAG implementation
β”œβ”€β”€ pyproject.toml                  # Python dependencies
β”œβ”€β”€ lancedb/                        # LanceDB embedded database directory
β”‚   β”œβ”€β”€ doc_sections.lance/         # Document sections table
β”‚   β”œβ”€β”€ qa_cache.lance/             # Q&A cache table
β”‚   └── question_library.lance/    # Question library table
β”œβ”€β”€ skills/
β”‚   └── pdf_parser/
β”‚       β”œβ”€β”€ pdf_parser.py           # Core PDF parsing (Docling)
β”‚       β”œβ”€β”€ enhanced_parser.py      # LLM-powered section analysis
β”‚       β”œβ”€β”€ ner_search.py           # NER and search engines
β”‚       β”œβ”€β”€ credit_analyst_prompt.py # Credit analyst classification
β”‚       └── llm_section_evaluator.py # Section importance scoring
β”œβ”€β”€ .cache/
β”‚   β”œβ”€β”€ documents/                  # Parsed document cache
β”‚   β”œβ”€β”€ kvcache/                    # KV-cache storage
β”‚   └── toc_sections/               # TOC-based section extraction
└── README.md                       # This file

Configuration

Model Selection

By default, CagVault uses Qwen3-14B locally via Ollama. To change models, edit config.py:

class Config:
    MODEL = DEEPSEEK_V3  # Change to any model constant
    OLLAMA_CONTEXT_WINDOW = 8192  # Adjust context size

Supported Providers

Currently, CagVault supports:

  • Ollama (default): Local inference, completely private, no API key needed
  • Groq (optional): Cloud inference, requires GROQ_API_KEY environment variable

Model Comparison for RAG

Model Size RAM Required Context Window Best For Speed
DeepSeek V3 685B 64GB+ 64K Best overall quality, complex reasoning Slow
DeepSeek R1 ~70B 32GB+ 32K Advanced reasoning, credit analysis Medium
Command R+ ~104B 32GB+ 128K RAG-optimized, long documents Medium
Mistral Large ~123B 32GB+ 128K Long-context tasks Medium
Llama 3.3 70B 70B 32GB+ 128K Strong reasoning, instruction following Medium
Gemma 2 27B 27B 16GB+ 8K Balanced reasoning Fast
Qwen3-14B ⭐ 14B 16GB 8K Default, excellent balance Fast
Phi-4 14B 16GB 16K Efficient, Microsoft-optimized Fast
Llama 3.1 8B 8B 8GB 128K Lightweight, fast responses Very Fast
Mistral Small 7B 8GB 32K Simple queries, minimal resources Very Fast

⭐ = Default model

Adding Custom Models

To add a new model not in the config:

# In config.py, add your model
MY_CUSTOM_MODEL = ModelConfig(
    "model-name-from-ollama",
    temperature=0.0,
    provider=ModelProvider.OLLAMA
)

# Then set it as default
class Config:
    MODEL = MY_CUSTOM_MODEL

For a full list of available models, visit ollama.com/library

Troubleshooting

Ollama Connection Error

Error: httpx.ConnectError: [Errno 61] Connection refused

Solution: Start the Ollama service:

brew services start ollama  # macOS
# or
ollama serve &  # Linux

Model Not Found

Error: ollama.ResponseError: model 'xyz' not found

Solution: Pull the model first:

ollama pull hf.co/unsloth/Qwen3-14B-GGUF:Q4_K_XL

Or use a different available model:

ollama pull llama2:latest

Python Version Issues

Error: Pydantic warnings or import errors

Solution: Ensure you're using Python 3.12:

python --version
# If not 3.12, recreate the virtual environment with Python 3.12
python3.12 -m venv .venv312
source .venv312/bin/activate
pip install -e .

Out of Memory

If the model runs out of memory during inference:

  • Use a smaller model (e.g., llama2:latest instead of qwen3:14b)
  • Reduce OLLAMA_CONTEXT_WINDOW in config.py
  • Reduce OLLAMA_NUM_PARALLEL in config.py (try 2 instead of 4)
  • Close other applications
  • Increase system swap space

Slow or Hanging Requests

If requests are timing out or hanging:

Check concurrent load:

# Monitor Ollama connections
lsof -i :11434 | wc -l  # Count active connections

Solutions:

  • Increase REQUEST_TIMEOUT in config.py for complex queries
  • Reduce OLLAMA_NUM_PARALLEL if system is overloaded
  • Check Ollama logs: ollama logs or check system console
  • Restart Ollama: brew services restart ollama

Optimal settings by RAM:

  • 8-16GB RAM: OLLAMA_NUM_PARALLEL = 2
  • 16-32GB RAM: OLLAMA_NUM_PARALLEL = 4 (default)
  • 32GB+ RAM: OLLAMA_NUM_PARALLEL = 6-8

See CONCURRENT_REQUESTS.md for detailed tuning guide.

LanceDB Storage Issues

Error: Database connection or table access issues

Solution:

# Check LanceDB directory permissions
ls -la ./lancedb

# If corrupted, remove and restart (will lose cached data)
rm -rf ./lancedb
streamlit run app.py  # Tables will be recreated

# To inspect LanceDB contents
python -c "import lancedb; db = lancedb.connect('./lancedb'); print(db.list_tables().tables)"

Note: LanceDB is embedded and requires no separate server. All data is stored locally in the ./lancedb directory.

KV-Cache Issues

If cache seems corrupted or causes issues:

# Clear the KV cache
rm -rf .cache/kvcache/

# Or clear via the UI
# Click "🧹 Clear Cache" in the sidebar

Q&A Cache Issues

If cached answers seem outdated or incorrect:

# Clear Q&A cache via UI:
# 1. Expand "πŸ’Ύ Q&A Cache Management" in sidebar
# 2. Click "πŸ—‘οΈ Clear All Q&A Cache"

# Or clear LanceDB cache programmatically:
python -c "from qa_cache import get_qa_cache; get_qa_cache().clear_all_cache()"

# Or remove the entire QA table:
python -c "import lancedb; db = lancedb.connect('./lancedb'); db.drop_table('qa_cache')"

Duplicate Sections / Looping

If you see repeated sections in the UI or logs:

Cause: Document loaded multiple times without clearing memory

Solution: This should be automatically prevented by the deduplication guards. If it still occurs:

# Restart the app (clears in-memory state)
pkill -f streamlit
streamlit run app.py

# Or clear LanceDB document cache
python -c "import lancedb; db = lancedb.connect('./lancedb'); db.drop_table('doc_sections')"

Section References Not Appearing

If cited sections don't auto-expand in chat:

Check:

  1. LLM is citing sections by number (e.g., "Section 5.12.2") or title
  2. Document has been parsed with enhanced parser (not URL-only)
  3. Section titles match citation format

Debug: Check the logs for "Referenced sections" or "No section titles detected"

Performance Considerations

Based on the CAG paper's experiments:

  • Small contexts (3-16 docs, ~21k tokens): CAG provides 10x+ speedup over dynamic context loading
  • Medium contexts (4-32 docs, ~32-43k tokens): CAG offers 17x+ speedup
  • Large contexts (7-64 docs, ~50-85k tokens): CAG achieves 40x+ speedup

The precomputed KV cache eliminates the need to reprocess documents for each query, making multi-turn conversations dramatically faster.

Technical Details

How CAG Works in This Application

  1. Document Upload: User uploads files or provides URLs
  2. Conversion: Docling converts documents to plain text
  3. Context Preloading: Documents are concatenated and passed to the LLM
  4. KV Cache: Ollama automatically caches the model's inference state (handled internally)
  5. Query Processing: User questions are appended to the cached context
  6. Streaming Response: The model generates answers using the preloaded knowledge

Technical Details

How CAG Works in This Application

  1. Document Upload: User uploads files or provides URLs
  2. Conversion: Docling converts documents to plain text
  3. Context Preloading: Documents are concatenated and passed to the LLM
  4. KV-Cache Creation: The model's inference state is precomputed and stored
  5. Efficient Queries: User questions are processed using the cached context
  6. Streaming Response: The model generates answers using preloaded knowledge

Current Architecture (December 2025)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           CAGVAULT ARCHITECTURE                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DOCUMENT INGESTION PIPELINE                                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  User Documents (PDF/TXT/MD/URL)                                           β”‚
β”‚           β”‚                                                                 β”‚
β”‚           β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                     β”‚
β”‚  β”‚  Docling Parser   β”‚  ← Converts PDFs with layout preservation          β”‚
β”‚  β”‚  (skills/pdf_*)   β”‚  ← OCR support (optional)                          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  ← Table detection                                 β”‚
β”‚           β”‚                                                                 β”‚
β”‚           β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚  β”‚  Enhanced Parser (LLM-Powered Analysis)       β”‚                        β”‚
β”‚  β”‚                                                β”‚                        β”‚
β”‚  β”‚  β€’ Hierarchical section extraction             β”‚                        β”‚
β”‚  β”‚  β€’ Parallel LLM importance scoring (4 workers) β”‚                        β”‚
β”‚  β”‚  β€’ Credit analyst classification               β”‚                        β”‚
β”‚  β”‚  β€’ Page-accurate tracking (word-based)         β”‚                        β”‚
β”‚  β”‚  β€’ Named Entity Recognition (NER)              β”‚                        β”‚
β”‚  β”‚  β€’ Cross-reference detection                   β”‚                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚           β”‚                                                                 β”‚
β”‚           β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚  β”‚  SectionMemoryStore (In-Memory)               β”‚                        β”‚
β”‚  β”‚                                                β”‚                        β”‚
β”‚  β”‚  β€’ Hierarchical document structure             β”‚                        β”‚
β”‚  β”‚  β€’ Section β†’ Subsection relationships          β”‚                        β”‚
β”‚  β”‚  β€’ Metadata indexing (pages, importance, type) β”‚                        β”‚
β”‚  β”‚  β€’ Deduplication prevention                    β”‚                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚           β”‚                                                                 β”‚
β”‚           β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚  β”‚  LanceDB Persistent Storage (Embedded)        β”‚                        β”‚
β”‚  β”‚                                                β”‚                        β”‚
β”‚  β”‚  Table: doc_sections                          β”‚                        β”‚
β”‚  β”‚  β€’ Hierarchical sections (parent_id, order)   β”‚                        β”‚
β”‚  β”‚  β€’ Full-text search indexes (content, title)  β”‚                        β”‚
β”‚  β”‚  β€’ Pre-computed keywords & entities           β”‚                        β”‚
β”‚  β”‚  β€’ Document metadata (pages, type, size)      β”‚                        β”‚
β”‚  β”‚                                                β”‚                        β”‚
β”‚  β”‚  In-Process Cache: 3s TTL DataFrame           β”‚                        β”‚
β”‚  β”‚  β€’ Sub-millisecond reads for frequent access  β”‚                        β”‚
β”‚  β”‚  β€’ Thread-safe with automatic invalidation    β”‚                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SEARCH & RETRIEVAL LAYER                                                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Keyword Search   β”‚  β”‚ Semantic Search   β”‚  β”‚ Agentic Search     β”‚    β”‚
β”‚  β”‚ (FullTextSearch) β”‚  β”‚ (Embedding-based) β”‚  β”‚ (Claude-powered)   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚           β”‚                     β”‚                        β”‚                β”‚
β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚
β”‚                                 β”‚                                          β”‚
β”‚                                 β–Ό                                          β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚                    β”‚  Search Results        β”‚                             β”‚
β”‚                    β”‚  + Relevance Scores    β”‚                             β”‚
β”‚                    β”‚  + Reasoning (Agentic) β”‚                             β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  CHAT & Q&A LAYER                                                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  User Question                                                              β”‚
β”‚      β”‚                                                                      β”‚
β”‚      β–Ό                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚  β”‚  Question Library (LanceDB)             β”‚                              β”‚
β”‚  β”‚                                          β”‚                              β”‚
β”‚  β”‚  Table: question_library                β”‚                              β”‚
β”‚  β”‚  β€’ 15+ categories (Definitions, etc.)   β”‚                              β”‚
β”‚  β”‚  β€’ Usage tracking & popularity          β”‚                              β”‚
β”‚  β”‚  β€’ Autocomplete suggestions (FTS)       β”‚                              β”‚
β”‚  β”‚  β€’ Per-document & global questions      β”‚                              β”‚
β”‚  β”‚  β€’ In-process cache (3s TTL)            β”‚                              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚           β”‚                                                                 β”‚
β”‚           β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚  β”‚  Q&A Cache (LanceDB)                    β”‚                              β”‚
β”‚  β”‚                                          β”‚                              β”‚
β”‚  β”‚  Table: qa_cache                        β”‚                              β”‚
β”‚  β”‚  Key: sha256(question + doc_ids)        β”‚                              β”‚
β”‚  β”‚  Value: {response, thinking, metadata}  β”‚                              β”‚
β”‚  β”‚  FTS Index: question field              β”‚                              β”‚
β”‚  β”‚                                          β”‚                              β”‚
β”‚  β”‚  Cache Hit? β†’ Return cached response βœ“  β”‚                              β”‚
β”‚  β”‚  Cache Miss? β†’ Continue to LLM ↓        β”‚                              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚           β”‚                                                                 β”‚
β”‚           β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚  β”‚  Context Builder                        β”‚                              β”‚
β”‚  β”‚                                          β”‚                              β”‚
β”‚  β”‚  β€’ Load full document content           β”‚                              β”‚
β”‚  β”‚  β€’ Build hierarchical context           β”‚                              β”‚
β”‚  β”‚  β€’ Include section metadata             β”‚                              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚           β”‚                                                                 β”‚
β”‚           β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚  β”‚  KV-Cache Manager                       β”‚                              β”‚
β”‚  β”‚                                          β”‚                              β”‚
β”‚  β”‚  β€’ Precompute context state             β”‚                              β”‚
β”‚  β”‚  β€’ Track token counts                   β”‚                              β”‚
β”‚  β”‚  β€’ Deduplicate sources                  β”‚                              β”‚
β”‚  β”‚  β€’ Persistent disk storage              β”‚                              β”‚
β”‚  β”‚                                          β”‚                              β”‚
β”‚  β”‚  10-40x speedup for multi-turn chat!    β”‚                              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚           β”‚                                                                 β”‚
β”‚           β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚  β”‚  Ollama LLM Server (4 Parallel Workers)β”‚                              β”‚
β”‚  β”‚                                          β”‚                              β”‚
β”‚  β”‚  Model: Qwen3-14B (Q4_K_XL quantized)   β”‚                              β”‚
β”‚  β”‚  Context: 8K+ tokens                    β”‚                              β”‚
β”‚  β”‚  Temperature: 0.0 (deterministic)       β”‚                              β”‚
β”‚  β”‚                                          β”‚                              β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚                              β”‚
β”‚  β”‚  β”‚ System Prompt              β”‚         β”‚                              β”‚
β”‚  β”‚  β”‚ β€’ Credit analyst expertise β”‚         β”‚                              β”‚
β”‚  β”‚  β”‚ β€’ Cross-reference checking β”‚         β”‚                              β”‚
β”‚  β”‚  β”‚ β€’ Citation requirements    β”‚         β”‚                              β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚                              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚           β”‚                                                                 β”‚
β”‚           β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚  β”‚  Response Stream                        β”‚                              β”‚
β”‚  β”‚                                          β”‚                              β”‚
β”‚  β”‚  <think>...</think> β†’ Reasoning         β”‚                              β”‚
β”‚  β”‚  Answer β†’ Final response                β”‚                              β”‚
β”‚  β”‚                                          β”‚                              β”‚
β”‚  β”‚  β€’ Auto-cache to LanceDB                β”‚                              β”‚
β”‚  β”‚  β€’ Extract section references           β”‚                              β”‚
β”‚  β”‚  β€’ Track to question library            β”‚                              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚           β”‚                                                                 β”‚
β”‚           β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚  β”‚  Referenced Section Matcher             β”‚                              β”‚
β”‚  β”‚                                          β”‚                              β”‚
β”‚  β”‚  β€’ Regex-based title matching           β”‚                              β”‚
β”‚  β”‚  β€’ Numeric prefix detection (5.12.2)    β”‚                              β”‚
β”‚  β”‚  β€’ Section/Β§ prefix variants            β”‚                              β”‚
β”‚  β”‚  β€’ Case-insensitive matching            β”‚                              β”‚
β”‚  β”‚  β€’ Subsection inclusion                 β”‚                              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚           β”‚                                                                 β”‚
β”‚           β–Ό                                                                 β”‚
β”‚  Streamlit UI Display:                                                     β”‚
β”‚  β€’ Chat messages                                                           β”‚
β”‚  β€’ Expandable thinking blocks                                              β”‚
β”‚  β€’ Referenced section expanders with full content                          β”‚
β”‚  β€’ Cache status indicators                                                 β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DATA FLOW SUMMARY                                                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  1. UPLOAD: PDF β†’ Docling β†’ Enhanced Parser β†’ Section Analysis (parallel)  β”‚
β”‚  2. STORE:  Sections β†’ Memory + LanceDB persistence                        β”‚
β”‚  3. INDEX:  Keywords + Entities + Semantic embeddings                      β”‚
β”‚  4. QUERY:  Question β†’ Library + Q&A Cache check                           β”‚
β”‚  5. SEARCH: Keyword/Semantic/Agentic β†’ Relevant sections                   β”‚
β”‚  6. BUILD:  Context from sections β†’ KV-Cache                               β”‚
β”‚  7. INFER:  LLM with cached context β†’ Streamed response                    β”‚
β”‚  8. MATCH:  Extract section refs β†’ Auto-expand in UI                       β”‚
β”‚  9. CACHE:  Store Q&A + Update library + Track usage                       β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

Core Infrastructure

  • Ollama: Local LLM inference server (Qwen3-14B)
  • LanceDB: Embedded vector database for persistent storage (Q&A cache, sections, questions)
  • Streamlit: Interactive web UI with real-time updates
  • LangChain: LLM orchestration and streaming

Document Processing

  • Docling (skills/pdf_parser/pdf_parser.py): PDF/HTML/TXT/MD conversion with layout preservation
  • EnhancedPDFParserSkill (skills/pdf_parser/enhanced_parser.py):
    • LLM-powered section extraction
    • Parallel importance scoring (ThreadPoolExecutor)
    • Hierarchical structure with page tracking
    • LanceDB persistence with deduplication guards
  • SectionMemoryStore: In-memory hierarchical document structure
  • NamedEntityRecognizer (skills/pdf_parser/ner_search.py): Extract and index entities

Search & Retrieval

  • FullTextSearchEngine: Fast keyword search with tokenization
  • Semantic Search: Embedding-based similarity matching
  • Agentic Search: Claude-powered intelligent query understanding

Caching System

  • KVCacheManager (kvcache.py): Context state caching with disk persistence
  • QACacheManager (qa_cache.py): LanceDB-backed Q&A caching with persistent storage
  • QuestionLibraryManager (question_library.py): Question tracking with categorization and usage analytics
  • LanceDBStore (lancedb_cache.py): Unified storage layer with in-process DataFrame cache (3s TTL)

Credit Analysis

  • CreditAnalystPrompt (skills/pdf_parser/credit_analyst_prompt.py): Section classification and importance
  • LLMSectionEvaluator (skills/pdf_parser/llm_section_evaluator.py): Batch analysis with parallel processing

LanceDB Storage Architecture

Unified Storage Layer (lancedb_cache.py):

  • Embedded Vector Database: No external server required, all data in ./lancedb directory
  • Three Main Tables:
    1. doc_sections: Hierarchical document sections with full-text search
    2. qa_cache: Question-answer pairs with thinking and metadata
    3. question_library: Popular questions with usage tracking and categorization

Schema Design:

# doc_sections table
document_id: string          # Unique document identifier
document_name: string        # Human-readable name
section_id: string          # Section unique ID
parent_id: string           # Parent section for hierarchy
level: int32                # Nesting level (1, 2, 3...)
order_idx: int32            # Preservation of document order
title: string               # Section title
content: string             # Section text content
keywords: list<string>      # Pre-computed search tokens
entities_json: string       # NER results (JSON)
metadata_json: string       # Section metadata
total_pages: int32          # Document page count
extraction_method: string   # Parser version/method
source: string              # Origin (upload, URL, etc.)
stored_at: string           # Timestamp (ISO 8601)

# qa_cache table
cache_key: string           # SHA256 hash of question + doc_ids
question: string            # Original question
response: string            # LLM answer
thinking: string            # Reasoning process
doc_ids: list<string>       # Associated documents
timestamp: string           # Cache creation time
metadata_json: string       # Model, source count, etc.

# question_library table
question: string            # Unique question text (normalized)
doc_ids: list<string>       # Related documents
category: string            # Question category
usage_count: int64          # Popularity metric
is_default: bool            # Pre-seeded question
created_at: string          # Creation timestamp
metadata_json: string       # Additional metadata

Performance Optimizations:

  1. Full-Text Search (FTS) Indexes:

    • doc_sections: content, title, document_name
    • qa_cache: question
    • question_library: question
  2. In-Process DataFrame Cache (3-second TTL):

    • Caches table contents as pandas DataFrames in memory
    • Sub-millisecond reads for frequent queries
    • Thread-safe with locks
    • Automatic invalidation on writes
    • Warmed on startup for instant first access
  3. Write Strategy:

    • Immediate writes to LanceDB (ACID-compliant)
    • Cache invalidation triggered after successful write
    • No blocking - operations complete quickly

Data Flow:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Application Request (Read)                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Check In-Process Cache (3s TTL)                            β”‚
β”‚ β€’ Thread-safe lock acquisition                             β”‚
β”‚ β€’ Check timestamp validity                                 β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚ Hit                                       β”‚ Miss
     β–Ό                                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Return DataFrameβ”‚                   β”‚ Query LanceDB Table  β”‚
β”‚ (sub-ms)        β”‚                   β”‚ β€’ Convert to pandas  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                   β”‚ β€’ Store in cache     β”‚
                                      β”‚ β€’ Return DataFrame   β”‚
                                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Application Request (Write)                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Write to LanceDB                                           β”‚
β”‚ β€’ ACID transaction                                         β”‚
β”‚ β€’ Immediate persistence                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Invalidate In-Process Cache                                β”‚
β”‚ β€’ Remove cached DataFrame                                  β”‚
β”‚ β€’ Next read will refresh from disk                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Migration from Redis:

  • Optional one-time migration utility: lancedb_cache.migrate_from_redis(redis_client)
  • Imports documents, Q&A cache, and question library
  • Preserves all metadata and relationships
  • No data loss during transition

Performance Optimizations

Multi-Layer Caching Strategy

1. KV-Cache (Context State)

  • No document reprocessing: Once cached, documents aren't re-tokenized
  • Multi-turn speedup: 10-40x faster for subsequent queries (from CAG paper)
  • Memory efficient: Tracks token counts and cache size
  • Automatic deduplication: Same documents aren't cached twice
  • Persistent storage: Caches stored on disk for reuse across sessions

2. Q&A Cache (Response Level)

  • Instant retrieval: Identical questions return cached answers immediately
  • Document-aware: Cache keys include document IDs for precise matching
  • Persistent storage: No expiration, manually managed via UI or API
  • Thinking included: Caches both reasoning and final response
  • Per-document management: Clear cache for specific documents

3. Document Section Cache (LanceDB)

  • Parse once: Parsed sections persisted to LanceDB with FTS indexes
  • Fast reload: Load document structure without re-parsing (in-process cache)
  • Hierarchical storage: Maintains parent-child relationships via order_idx
  • Search index: Pre-computed keywords and entities with full-text search
  • Deduplication guards: Prevents repeated section additions
  • In-process cache: 3-second TTL DataFrame cache for frequent reads

Parallel Processing & Concurrent Requests

Concurrent Request Handling

  • 4 parallel LLM workers handle simultaneous requests
  • Non-blocking chat responses during document processing
  • Multiple users can interact concurrently
  • Configurable via Config.OLLAMA_NUM_PARALLEL
  • 5-minute request timeout prevents hanging operations
  • See CONCURRENT_REQUESTS.md for detailed configuration

Section Analysis (4 workers)

  • Concurrent LLM calls for importance scoring
  • Classification of section types (COVENANT, DEFAULT, etc.)
  • Batch processing of subsections
  • Progress logging every 10 sections

Word-Based Page Estimation

  • ~250 words per page heuristic
  • Instant calculation vs. slow LLM page range calls
  • Accurate enough for UI display and citations

Memory Management

In-Memory Section Store

  • Fast lookups by section ID
  • Hierarchical traversal for subsections
  • Automatic memory clearing before fresh loads
  • Prevents duplicate section accumulation

Best Practices

For Credit Agreement Analysis

  1. Upload Full Agreement: Include all sections, schedules, and amendments
  2. Let Parsing Complete: Wait for parallel LLM analysis to finish (progress shown)
  3. Use Agentic Search: For complex queries, agentic search provides reasoning
  4. Check Referenced Sections: Always expand cited sections to verify context
  5. Review Cache: Use Q&A cache management to track analysis history

For Optimal Performance

  1. Enable Redis: Install and run Redis for best caching performance
  2. Batch Upload: Upload all related documents before starting Q&A
  3. Use Suggested Questions: Build question library for faster team collaboration
  4. Monitor Cache Stats: Clear old caches periodically to free memory
  5. Parallel Processing: Parser uses 4 workers by default; increase for faster analysis

For Question Library

  1. Categorize Thoughtfully: Questions are auto-categorized but review for accuracy
  2. Track Usage: Popular questions surface to the top automatically
  3. Search Before Asking: Use autocomplete to find existing answers
  4. Document-Specific: Filter questions by document for focused analysis
  5. Clear Periodically: Remove outdated questions to keep library relevant

For Multi-Document Context

  1. Related Documents: Upload contracts and amendments together
  2. Clear Context Cache: When switching document sets, clear cache
  3. Check Message Source IDs: Verify which documents are in context
  4. Redis Loading: For frequently used documents, load from Redis cache

Limitations

Context Window Constraints

  • Qwen3-14B: ~8K tokens (~3-4 medium PDFs or 1 large credit agreement)
  • Token Estimation: ~750 tokens per page for dense legal documents
  • Workaround: Focus on specific sections or use search to find relevant parts

Memory Requirements

  • Minimum: 8GB RAM for 7B models
  • Recommended: 16GB RAM for 14B models
  • With Redis: Additional ~100MB-1GB depending on document count
  • Section Analysis: Uses 4 parallel workers (can adjust in code)

Redis Dependency

  • Optional: App works without Redis but with limited features
  • Q&A Cache: Requires Redis for persistence
  • Question Library: Requires Redis for cross-session storage
  • Document Sections: Can use memory-only but won't persist

Not Ideal For

  • Constantly Updating Knowledge: Traditional RAG better for dynamic data
  • Very Large Corpora: 100+ documents may exceed context limits
  • Real-Time Collaboration: Single-user app, not designed for teams
  • Production Deployments: This is a research/analysis tool, not a production service

Recent Changes (December 2025)

Enhanced PDF Intelligence

  • βœ… Parallel LLM Section Analysis: 4 concurrent workers for faster parsing
  • βœ… Credit Analyst Classification: Automatic detection of COVENANTS, DEFAULTS, etc.
  • βœ… Importance Scoring: AI-driven relevance analysis (0-1 scale)
  • βœ… Page-Accurate Tracking: Word-based estimation for instant page mapping
  • βœ… Hierarchical Sections: Full parent-child relationships preserved

Search & Discovery

  • βœ… Multi-Modal Search: Keyword, semantic, and agentic (Claude-powered)
  • βœ… Named Entity Recognition: Extract PARTY, DATE, MONEY, AGREEMENT entities
  • βœ… Entity Filtering: Browse by entity type across all sections
  • βœ… Section References: Auto-expand cited sections in chat responses

Caching System

  • βœ… Q&A Cache: LanceDB-backed with persistent storage
  • βœ… Question Library: 15+ categories with autocomplete
  • βœ… Suggested Questions: Popular queries by document or global
  • βœ… Cache Analytics: Real-time stats and management UI
  • βœ… Deduplication Guards: Prevent repeated section additions

UI/UX Improvements

  • βœ… Document Tabs: Sections, Search, Entities in organized tabs
  • βœ… Cache Indicators: Visual feedback for cache hits
  • βœ… Referenced Section Expanders: Click to view full cited sections
  • βœ… Browse by Category: Explore questions by type
  • βœ… LanceDB Document Picker: Load previously parsed documents

Performance

  • βœ… Concurrent Request Handling: 4 parallel LLM workers for simultaneous requests
  • βœ… Memory Management: Automatic clearing before fresh loads
  • βœ… Parallel Processing: ThreadPoolExecutor for section analysis
  • βœ… LanceDB Persistence: Store parsed sections with FTS indexes for instant reload
  • βœ… Word-Based Estimation: Fast page calculation without LLM calls
  • βœ… Connection Pooling: Optimized Ollama connections with timeout management

Technical

  • βœ… Python 3.14 Support: Compatible with latest Python
  • βœ… Embedded Storage: No external database server required
  • βœ… Enhanced Error Handling: Better logging and fallbacks
  • βœ… Document Deduplication: Prevent duplicate button keys

Citation

If you use this project or the CAG methodology, please cite the original paper:

@inproceedings{chan2025cag,
  title={Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks},
  author={Chan, Brian J and Chen, Chao-Ting and Cheng, Jui-Hung and Huang, Hen-Hsen},
  booktitle={Proceedings of the ACM Web Conference 2025},
  year={2025}
}

License

MIT License - See LICENSE file for details

Contributing

Contributions welcome! Please open an issue or submit a pull request.

Author

Created by Amitabha Karmakar

Support

Getting Help

For Issues or Questions:

  1. Check the Troubleshooting section above
  2. Review the Best Practices for optimal usage
  3. Check logs in the terminal where you ran streamlit run app.py
  4. Open a GitHub issue with:
    • Error message and full traceback
    • Python version (python --version)
    • Ollama status (ollama list)
    • LanceDB tables (python -c "import lancedb; print(lancedb.connect('./lancedb').list_tables().tables)")
    • Steps to reproduce

Documentation:

  • CAG Paper: https://arxiv.org/abs/2412.15605v1
  • Implementation Details:
    • documentation/AGENTIC_RAG_GUIDE.md - NEW! Multi-step reasoning RAG system
    • documentation/AGENT_SDK_INTEGRATION.md - NEW! Claude Agent SDK MCP tools
    • documentation/MCP_TOOLS_GUIDE.md - MCP tools user guide
    • documentation/QA_CACHE_IMPLEMENTATION.md - Q&A caching system
    • documentation/QUESTION_LIBRARY_IMPLEMENTATION.md - Question library design
    • documentation/PDF_PARSER_SKILL_SUMMARY.md - Enhanced PDF parsing
    • documentation/CLAUDE_SKILLS_GUIDE.md - Claude skills integration
    • skills/pdf_parser/ENHANCED_PARSER_GUIDE.md - Advanced document parsing

Logs & Debugging:

# Check terminal output for detailed logs
# Logs include:
# - Section extraction progress
# - LLM analysis status
# - Cache hits/misses
# - LanceDB storage status
# - Entity extraction results

# Enable more verbose logging (if needed):
export LOG_LEVEL=DEBUG
streamlit run app.py

Built with ❀️ using Qwen3, Ollama, LangChain, Docling, and Streamlit

About

Cache Augmented Generation. A faster agentic AI for document chats.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published