A Cache-Augmented Generation (CAG) application for private, local document chat using large language models with intelligent document parsing, LanceDB-backed persistent storage, and credit agreement analysis capabilities.
# 1. Install prerequisites
brew install ollama
brew services start ollama
# 2. Clone and setup
git clone https://github.com/letslego/cagvault.git
cd cagvault
python3.12 -m venv .venv312
source .venv312/bin/activate
pip install -e .
# 3. Download LLM model (choose one based on your RAM)
# Default (16GB RAM): Qwen3-14B
ollama pull hf.co/unsloth/Qwen3-14B-GGUF:Q4_K_XL
# Or for best quality (64GB+ RAM): DeepSeek V3
# ollama pull deepseek-ai/DeepSeek-V3
# Or lightweight (8GB RAM): Llama 3.1 8B
# ollama pull llama3.1:8b
# 4. Start the app
streamlit run app.py
# Open http://localhost:8501 in your browser
# 5. Upload a PDF and start chatting!First-Time Tips:
- Upload a credit agreement PDF to see section analysis in action
- Try "π‘ Suggested Questions" after parsing completes
- Explore the "Sections" tab to see hierarchical structure
- Use "Agentic Search" for intelligent query understanding
π€ Agentic RAG System:
- π§ Multi-Step Reasoning: Agent understands intent, selects strategy, validates answers
- π― 5 Retrieval Strategies: Semantic, Keyword, Hybrid, Agentic, Entity-based (auto-selected)
- β Self-Reflection: Optional answer validation with confidence scoring
- π Full Transparency: Complete reasoning traces showing agent's thought process
- π Smart Strategy Selection: Automatically chooses best approach based on query type
- π§ Claude Agent SDK Integration: 6 specialized MCP tools built with Agent SDK:
- π Web Search: Fetch current data from external sources (@tool decorator)
- π·οΈ Entity Extraction: Extract dates, amounts, names, organizations (NER-based)
- π Section Ranking: Prioritize important sections using credit analyst criteria
- π Cross-Document Relationships: Find references, amendments, guarantees
- π Fact Verification: Validate claims against web sources
- π‘ Follow-Up Suggestions: Intelligent next-question recommendations
Storage Architecture Upgrade:
- ποΈ LanceDB Embedded Database: Replaced Redis with LanceDB for all persistent storage
- β‘ In-Process Caching: 3-second TTL DataFrame cache for sub-millisecond reads
- π Full-Text Search: Built-in FTS indexes on content, titles, and questions
- π¦ Zero External Dependencies: No separate database server required - all data in
./lancedb - π Redis Migration Tool: One-time utility to import existing Redis data
- π ACID Compliance: Reliable transactions with automatic cache invalidation
Enhanced PDF Intelligence:
- π¬ LLM-Powered Section Analysis: Parallel processing with credit analyst classification and importance scoring
- π Smart Section Extraction: Hierarchical document structure with page-accurate tracking
- π Multi-Modal Search: Keyword, semantic, and agentic (Claude-powered) search within documents
- π·οΈ Named Entity Recognition: Extract and index parties, dates, amounts, and legal terms
- π Referenced Section Display: Automatically expand cited sections in chat responses
Intelligent Caching System:
- πΎ Q&A Cache: LanceDB-backed answer caching per document with persistent storage
- π Question Library: Track popular questions by category with autocomplete suggestions
- β‘ KV-Cache Optimization: 10-40x faster multi-turn conversations
- π Cache Analytics: Real-time statistics and per-document cache management
Credit Agreement Features:
- π Document Classification: Automatic detection of covenants, defaults, and key provisions
- π― Section Importance Scoring: AI-driven relevance analysis for credit analysts
- π Cross-Reference Detection: Track dependencies between sections
- π Page-Accurate Citations: Precise page ranges for every section
Based on the paper "Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks" (WWW '25), CAG is an alternative paradigm to traditional Retrieval-Augmented Generation (RAG) that leverages the extended context capabilities of modern LLMs.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRADITIONAL RAG WORKFLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β User Query β
β β β
β βΌ β
β ββββββββββββββββββββ ββββββββββββββββββββ β
β β Retriever βββββββββββ€ Search Index β β±οΈ LATENCY β
β β (BM25/Dense) β β (Large DB) β β
β ββββββββββ¬ββββββββββ ββββββββββββββββββββ β
β β β
β β Retrieved Documents β
β βΌ β
β ββββββββββββββββββββ β
β β Generator (LLM) β β οΈ Risk of: β
β β β β’ Missing relevant docs β
β β (Generate Ans) β β’ Ranking errors β
β ββββββββββ¬ββββββββββ β’ Search failures β
β β β
β βΌ β
β Answer β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CACHE-AUGMENTED GENERATION (CAG) WORKFLOW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββ SETUP PHASE (One-time) ββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β All Documents β β
β β β β β
β β βΌ β β
β β ββββββββββββββββββββ β β
β β β LLM Processor β Populate LanceDB Cache β β
β β β (Batch Process) β (Sections + Q&A store) β β
β β ββββββββββ¬ββββββββββ β β
β β β β β
β β βΌ β β
β β ββββββββββββββββββββββββ β β
β β β Cached LanceDB Storeβ πΎ Embedded on Disk β β
β β β (Ready to use) β β β
β β ββββββββββββ¬ββββββββββββ β β
β β β β β
β βββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββ INFERENCE PHASE (Fast) βββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β User Query LanceDB Cache β β
β β β β β β
β β ββββββββββββ¬ββββββββ β β
β β βΌ β β
β β ββββββββββββββββββββββββ β β
β β β LLM + LanceDB Hits β β¨ LOCAL RETRIEVAL! β β
β β β (Context + cache) β β¨ LOW LATENCY! β β
β β β β β¨ GUARANTEED CONTEXT! β β
β β ββββββββββββ¬ββββββββββββ β β
β β β β β
β β βΌ β β
β β Answer (Instant) β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββ MULTI-TURN OPTIMIZATION ββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β For next query: Simply truncate and reuse cached knowledge β β
β β (No need to reprocess documents) β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. Preload Phase (One-time setup)
- All relevant documents are loaded into the LLM's extended context window
- The model processes the entire knowledge base at once
2. Cache Phase (Offline computation)
- The model's key-value (KV) cache is precomputed and stored
- This cache encapsulates the inference state of the LLM with all knowledge
- No additional computation needed for each query
3. Inference Phase (Fast queries)
- User queries are appended to the preloaded context
- The model uses the cached parameters to generate responses directly
- No retrieval step needed β Instant answers
4. Reset Phase (Multi-turn optimization)
- For new queries, the cache is efficiently truncated and reused
- The preloaded knowledge remains available without reprocessing
- β Zero Retrieval Latency: No real-time document search
- β Unified Context: Holistic understanding of all documents
- β Simplified Architecture: Single model, no retriever integration
- β Eliminates Retrieval Errors: All relevant information is guaranteed to be available
- β Perfect for Constrained Knowledge Bases: Ideal when all documents fit in context window
CagVault now runs as a local agentic stack that combines Streamlit UI, Claude Agent SDK tools, and LanceDB-backed storage.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Browser β
β Streamlit UI (app.py) β
β - Chat with reasoning trace and skill tags β
β - Upload/parse PDFs and manage caches β
β - Question library + sections/entities explorer β
βββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β questions, uploads, actions
βββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Agent Brain β
β Router: question classifier + skill inference β
β Planner: chooses cached answer, retrieval, or tool use β
β Reasoner: Claude/Ollama models with reflection β
β Tools (Claude Agent SDK via MCP): β
β β’ web_search β’ entity_extractor β’ section_ranker β
β β’ cross_doc_links β’ fact_verifier β’ followup_suggester β
β Skills: PDF parser, TOC/NER search, credit analyst prompts, β
β knowledge-base skill registry β
β Caches: Q&A cache (LanceDB), question library (LanceDB), β
β in-memory DataFrame cache β
βββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β retrieval + storage calls
βββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Storage and Engines β
β LanceDB (embedded): doc_sections, qa_cache, question_library β
β Search: full-text, semantic, agentic rerank, entity filters β
β Runtimes: Ollama models, CAG MCP server hosting the tools β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Flows:
- Upload/Parse β LanceDB: PDFs run through Docling + LLM section analysis, saved to
doc_sectionswith entities and TOC metadata. - Ask β Router β Cache (default mode): Questions first check LanceDB Q&A cache/question library before invoking the LLM.
- Retrieval/Tools: When needed, the agent retrieves sections from LanceDB or calls MCP tools (web, entity, ranking, cross-doc, verification, follow-ups).
- Answering: Responses stream with reasoning trace, cited sections, and the skills/tools used for transparency.
- Persistence: All storage is local (LanceDB + optional caches); no cloud services are required.
Execution Modes:
- Default (LanceDB Chat): Uses LanceDB retrieval plus Q&A cache and question library for fast local answers. No MCP tools or multi-step agent planning are invoked.
- Agentic RAG Mode (toggle in UI): Adds planning, strategy selection, and MCP tools (web search, entities, ranking, cross-doc, fact check, follow-ups). This path currently bypasses the LanceDB Q&A cache for answers.
Knowledge Base Skills:
- Skills live locally in
knowledge-base/and are inferred by lightweight keyword heuristics. They are rendered with each answer for transparency and kept private on-disk (see.gitignore).
- Fully Local & Private: No API keys, cloud services, or internet required (except Redis optional)
- Document Control: All processing happens on your machine
- Optional Redis: Can run fully in-memory without Redis for maximum privacy
- Enhanced PDF Parsing: Using Docling with LLM-powered section analysis
- Multi-Format Support: PDF, TXT, MD files and web URLs
- Hierarchical Structure: Automatic detection of sections, subsections, and tables
- Named Entity Recognition: Extract parties, dates, monetary amounts, and legal terms
- Page-Accurate Tracking: Precise page ranges for every section
- Keyword Search: Fast full-text search across all sections
- Semantic Search: AI-powered similarity matching
- Agentic Search: Claude-driven intelligent query understanding with reasoning
- Entity Filtering: Search by PARTY, DATE, MONEY, AGREEMENT, or PERCENTAGE
- Q&A Cache: Redis-backed answer caching with automatic deduplication
- Question Library: Track popular questions organized by 15+ categories
- KV-Cache Optimization: 10-40x faster multi-turn conversations
- Cache Analytics: Real-time statistics and granular cache management
- Document-Specific Caching: Per-document cache with TTL management
- Streaming Responses: Real-time generation with thinking process visibility
- Referenced Sections: Auto-expand cited sections in answers
- Suggested Questions: Category-based question recommendations
- Autocomplete Search: Type-ahead suggestions from question library
- Multi-Document Context: Chat across multiple documents simultaneously
- Speech-to-Text (STT): Record questions via microphone using OpenAI's Whisper API
- Text-to-Speech (TTS): Synthesize answers to audio using pyttsx3 (local synthesis)
- Voice Input: Ask questions hands-free, ideal for multitasking
- Voice Output: Listen to answers while reviewing documents
- Configurable Settings: Adjust recording duration, speech rate, and volume
- Privacy: Local TTS synthesis; Whisper STT is API-based but can be disabled
- Section Classification: Automatic identification of COVENANTS, DEFAULTS, DEFINITIONS, etc.
- Importance Scoring: AI-driven relevance analysis for credit analysts
- Cross-Reference Tracking: Detect dependencies between sections
- Covenant Analysis: Specialized understanding of debt agreements and financial covenants
- Large Context Windows: Leverages Qwen3-14B's 8K+ token capacity
- Concurrent Request Handling: 4 parallel LLM workers for simultaneous requests
- Parallel Processing: Concurrent LLM calls for faster document analysis (4 workers)
- Smart Page Estimation: Word-based calculation for instant section mapping
- Memory Management: In-memory section store with LanceDB persistence
- Connection Pooling: Optimized Ollama connections with timeout management
The system now includes OpenLineage-compliant data lineage tracking to monitor your document processing pipeline:
Click the "π Lineage" button in the top-right corner of the app, or navigate to:
http://localhost:8501/lineage_dashboard
Monitor end-to-end data flow from document ingestion β embedding β retrieval β LLM response:
Dashboard Views:
- Overview (default): Total events, assets, success rate, and operation breakdown with visualizations
- Events Timeline: Chronological event log with filtering by operation type
- Asset Lineage: Trace complete data flow for any document or section
- Performance Analysis: Duration distribution, trends over time, and slowest operations
The system automatically tracks:
- π ingest: Document ingestion (PDF files with metadata)
- π extract_section: Section extraction from documents
- π’ embed: Embedding generation from sections (1024 dimensions)
- ποΈ store_lancedb: Storage in LanceDB vector database
- π retrieve: Retrieval from LanceDB (with cache hit tracking)
- π€ llm_response: LLM-generated answers with model info
- β±οΈ Operation duration (milliseconds)
- π Total events and unique assets
- β Success rate and status breakdown
- π Operation counts and average durations
- π Complete data flow lineage for any asset
- Upload a Document: When you upload a PDF, all processing steps are automatically tracked
- Ask Questions: Retrievals and LLM responses are tracked in real-time
- View Dashboard: Click "π Lineage" button to see live metrics and visualizations
- Analyze Performance: Check the Performance Analysis view for bottlenecks
- Storage: SQLite database at
.cache/lineage.db(local, private) - Overhead: ~5-10ms per tracked operation (minimal impact)
- Standard: OpenLineage-compliant metadata format
- Dependencies: None (uses built-in SQLite + Plotly)
- No External Services: All data stored locally
See documentation/DATA_LINEAGE_GUIDE.md for detailed API documentation and documentation/LINEAGE_IMPLEMENTATION.md for implementation details.
- macOS (or Linux/Windows with appropriate package managers)
- Python 3.12.x or 3.14.x
- Homebrew (for macOS)
- At least 10GB free disk space (for the LLM model)
- 16GB RAM recommended (8GB minimum for 7B models)
- LanceDB (included with dependencies)
- Embedded vector database for persistent storage
- No separate installation or server required
- Automatically stores: Q&A cache, question library, parsed document sections
- Database location:
./lancedbdirectory - Migration tool available for existing Redis users (see below)
git clone https://github.com/letslego/cagvault.git
cd cagvaultCreate a Python 3.12 virtual environment:
python3.12 -m venv .venv312
source .venv312/bin/activateInstall all required Python packages:
pip install -e .This will install:
streamlit- Web UI frameworklangchain-core,langchain-ollama,langchain-groq,langchain-community- LLM orchestrationdocling- Document conversion library- Other dependencies (see
pyproject.toml)
Ollama is a local LLM inference server that runs models entirely on your machine.
brew install ollama
brew services start ollamaVerify Ollama is running:
ollama listcurl -fsSL https://ollama.com/install.sh | sh
ollama serve &Download and run the installer from ollama.com/download
CagVault supports multiple high-performance models optimized for RAG and document understanding. Choose based on your hardware and performance needs.
Download 3 essential models covering all use cases (~30GB):
./download_essential_models.shThis installs:
- Qwen3-14B (Default) - Best balance of quality and speed
- Llama 3.1 8B (Lightweight) - Fast responses, low memory
- Phi-4 (Efficient) - Microsoft's optimized model
To download all 10 supported models (~200GB):
./download_models.shAlternatively, download individual models:
DeepSeek V3 (Recommended for Best Quality) - 685B parameters, state-of-the-art reasoning:
ollama pull deepseek-ai/DeepSeek-V3Requires: 64GB+ RAM, Apple Silicon M3 Max or similar
Qwen3-14B (Default) - Excellent balance of quality and speed:
ollama pull hf.co/unsloth/Qwen3-14B-GGUF:Q4_K_XLRequires: 16GB+ RAM
DeepSeek R1 - Advanced reasoning for complex credit agreement queries:
ollama pull deepseek-ai/DeepSeek-R1Requires: 32GB+ RAM
Mistral Large - Excellent long-context performance:
ollama pull mistral-large-latestRequires: 32GB+ RAM
Command R+ - Cohere's RAG-optimized model:
ollama pull command-r-plus:latestRequires: 32GB+ RAM
Phi-4 - Microsoft's efficient 14B model:
ollama pull phi4:latestLlama 3.1 8B - Fast and lightweight:
ollama pull llama3.1:8bMistral Small - Quick responses for simpler queries:
ollama pull mistral-small-latestLlama 3.3 70B - Strong reasoning:
ollama pull llama3.3:70bGemma 2 27B - Google's reasoning model:
ollama pull gemma2:27bOption 1: Use the UI (Recommended)
- Start the app:
streamlit run app.py - Open the sidebar
- Expand "π€ Model Settings"
- Select your preferred model from the dropdown
- Click "π Restart App" to apply
The UI shows RAM requirements and speed for each model to help you choose.
Option 2: Edit Config File
Edit config.py directly:
class Config:
MODEL = DEEPSEEK_V3 # Change from QWEN_3 to any model above
OLLAMA_CONTEXT_WINDOW = 8192Available model constants:
QWEN_3(default) - Qwen3-14BDEEPSEEK_V3- DeepSeek V3 (best quality)DEEPSEEK_R1- DeepSeek R1 (advanced reasoning)MISTRAL_LARGE- Mistral LargeMISTRAL_SMALL- Mistral SmallLLAMA_3_3_70B- Llama 3.3 70BLLAMA_3_1_8B- Llama 3.1 8BPHI_4- Phi-4GEMMA_2_27B- Gemma 2 27BCOMMAND_R_PLUS- Command R+
# List installed models
ollama list
# Search for models on Ollama library
ollama search deepseek
ollama search mistral
ollama search llama3
# Pull any model
ollama pull <model-name>If you have existing data in Redis, you can migrate it to LanceDB:
# In Python console or script
from lancedb_cache import get_lancedb_store
import redis
# Connect to your Redis instance
redis_client = redis.from_url("redis://localhost:6379/0")
# Migrate all data (documents, Q&A cache, question library)
store = get_lancedb_store()
store.migrate_from_redis(redis_client)
print("Migration complete! Redis data imported to LanceDB.")Note: After migration, you can optionally remove Redis. LanceDB is now the default persistent storage and requires no separate server.
Voice features allow speech-to-text input and text-to-speech output.
Use local Whisper models - no API keys needed, 100% offline:
# Fast, optimized local Whisper (recommended)
pip install pyttsx3 sounddevice soundfile faster-whisper
# OR standard local Whisper
pip install pyttsx3 sounddevice soundfile openai-whisperAdvantages:
- β Completely free and open source
- β Works offline (no internet required)
- β No API keys or usage limits
- β Privacy-focused (data stays local)
Model sizes (faster-whisper or openai-whisper):
tiny- Fastest, least accurate (~75MB)base- Good balance (default, ~145MB)small- Better accuracy (~466MB)medium- High accuracy (~1.5GB)large- Best accuracy (~3GB)
If you prefer cloud-based STT with API:
pip install pyttsx3 sounddevice soundfile openai
# Set API key
export OPENAI_API_KEY="your-api-key-here"Note: Both options use pyttsx3 for TTS (already open source and local).
Check that everything is installed correctly:
# Python environment
python --version # Should show 3.12.x or 3.14.x
# Ollama service
ollama list # Should show your downloaded models
# Python packages
pip list | grep -E "(streamlit|langchain|docling|lancedb)"
# Optional voice features (if installed)
pip list | grep -E "(pyttsx3|sounddevice|openai)"With your virtual environment activated:
streamlit run app.pyThe application will open in your browser at http://localhost:8504
Via File Upload:
- Click the file uploader in the sidebar
- Select PDF, TXT, or MD files
- Watch the enhanced parsing process with section extraction
- View parsing statistics: pages, sections, entities found
Via URL:
- Paste a web URL in the text input
- Click "Add Web Page" to scrape and convert to text
From LanceDB:
- Click "ποΈ Documents in LanceDB" expander
- Select a previously parsed document
- Click "Load for chat" to restore from persistent storage
Sections Tab:
- Browse hierarchical document structure
- View page ranges, word counts, and table indicators
- Click to expand section content
- See coverage statistics (page distribution)
Search Tab:
- Agentic Search: Claude-powered intelligent search with reasoning
- Keyword Search: Fast full-text search with match counts
- Semantic Search: AI similarity matching with relevance scores
Entities Tab:
- Filter by type: MONEY, DATE, PARTY, AGREEMENT, PERCENTAGE
- Click entities to see source sections
- Track key document information
Direct Input:
- Type your question in the chat input at the bottom
- Press Enter to submit
Suggested Questions:
- Click "π‘ Suggested Questions" to see popular queries
- Click any suggestion to instantly ask it
- Questions are categorized: Definitions, Parties, Financial, etc.
Browse by Category:
- Click "π Browse by Category"
- Explore questions organized by 15+ categories
- View document-specific or global questions
Chat Messages:
- Thinking Process: Expand "CAG's thoughts" to see reasoning
- Streaming Answers: Watch responses generate in real-time
- Cache Indicator: "πΎ Using cached response" shows when answers are cached
Referenced Sections:
- Automatically expands sections cited in the answer
- Click section expanders to view full content
- Includes page ranges and section metadata
Cache Status:
- Green "πΎ Using cached response" = instant retrieval from LanceDB
- No indicator = fresh LLM generation + automatic caching to LanceDB
Voice Output (Optional):
- Click "π Speak" button next to assistant responses
- Hear synthesized answers while reviewing documents
- Adjust speech rate and volume in sidebar "π€ Voice Features"
- Perfect for hands-free operation or accessibility
Voice Input:
- Click "ποΈ Record Question" to start recording via microphone
- Audio is recorded locally (privacy-first)
- Click "π Transcribe" to convert speech to text using Whisper
- Transcribed question is automatically submitted for analysis
Voice Output:
- Click "π Speak" below any assistant answer
- Answer text is synthesized to audio using local TTS
- Adjust settings in sidebar (speech rate, volume)
- Audio plays inline with playback controls
Configuration (Sidebar):
- Expand "π€ Voice Features" section
- Toggle "ποΈ Voice Input" to enable recording
- Toggle "π Voice Output" to enable synthesis
- Adjust recording duration (5-60 seconds)
- Adjust speech rate (50-300 words per minute)
- Adjust output volume (0.0-1.0)
Requirements:
- Voice Input: Requires
OPENAI_API_KEY(uses Whisper API) - Voice Output: Works offline with pyttsx3 (no API needed)
- Both: Requires audio recording libraries (
sounddevice,soundfile)
Cache Stats (Sidebar):
- View total contexts, tokens, and cache hits
- Clear all context cache with "π§Ή Clear Cache"
Q&A Cache Management:
- View cached Q&A pairs per document
- Browse questions with thinking and responses
- Clear per-document cache or all Q&A cache
- Persistent storage in LanceDB (no memory limits)
Question Library:
- Search library with autocomplete
- View usage counts and categories
- Delete individual questions
- Clear entire library
cagvault/
βββ app.py # Streamlit UI with enhanced features
βββ config.py # Model configuration and settings
βββ models.py # LLM factory (Ollama/Groq)
βββ knowledge.py # Document loading and conversion
βββ chatbot.py # Chat logic with streaming and prompts
βββ kvcache.py # KV-Cache manager for context caching
βββ lancedb_cache.py # LanceDB storage layer with in-process cache
βββ qa_cache.py # LanceDB-backed Q&A caching system
βββ question_library.py # Question library with categorization
βββ voice_features.py # Speech-to-text and text-to-speech (optional)
βββ simple_cag.py # Simplified CAG implementation
βββ pyproject.toml # Python dependencies
βββ lancedb/ # LanceDB embedded database directory
β βββ doc_sections.lance/ # Document sections table
β βββ qa_cache.lance/ # Q&A cache table
β βββ question_library.lance/ # Question library table
βββ skills/
β βββ pdf_parser/
β βββ pdf_parser.py # Core PDF parsing (Docling)
β βββ enhanced_parser.py # LLM-powered section analysis
β βββ ner_search.py # NER and search engines
β βββ credit_analyst_prompt.py # Credit analyst classification
β βββ llm_section_evaluator.py # Section importance scoring
βββ .cache/
β βββ documents/ # Parsed document cache
β βββ kvcache/ # KV-cache storage
β βββ toc_sections/ # TOC-based section extraction
βββ README.md # This file
By default, CagVault uses Qwen3-14B locally via Ollama. To change models, edit config.py:
class Config:
MODEL = DEEPSEEK_V3 # Change to any model constant
OLLAMA_CONTEXT_WINDOW = 8192 # Adjust context sizeCurrently, CagVault supports:
- Ollama (default): Local inference, completely private, no API key needed
- Groq (optional): Cloud inference, requires
GROQ_API_KEYenvironment variable
| Model | Size | RAM Required | Context Window | Best For | Speed |
|---|---|---|---|---|---|
| DeepSeek V3 | 685B | 64GB+ | 64K | Best overall quality, complex reasoning | Slow |
| DeepSeek R1 | ~70B | 32GB+ | 32K | Advanced reasoning, credit analysis | Medium |
| Command R+ | ~104B | 32GB+ | 128K | RAG-optimized, long documents | Medium |
| Mistral Large | ~123B | 32GB+ | 128K | Long-context tasks | Medium |
| Llama 3.3 70B | 70B | 32GB+ | 128K | Strong reasoning, instruction following | Medium |
| Gemma 2 27B | 27B | 16GB+ | 8K | Balanced reasoning | Fast |
| Qwen3-14B β | 14B | 16GB | 8K | Default, excellent balance | Fast |
| Phi-4 | 14B | 16GB | 16K | Efficient, Microsoft-optimized | Fast |
| Llama 3.1 8B | 8B | 8GB | 128K | Lightweight, fast responses | Very Fast |
| Mistral Small | 7B | 8GB | 32K | Simple queries, minimal resources | Very Fast |
β = Default model
To add a new model not in the config:
# In config.py, add your model
MY_CUSTOM_MODEL = ModelConfig(
"model-name-from-ollama",
temperature=0.0,
provider=ModelProvider.OLLAMA
)
# Then set it as default
class Config:
MODEL = MY_CUSTOM_MODELFor a full list of available models, visit ollama.com/library
Error: httpx.ConnectError: [Errno 61] Connection refused
Solution: Start the Ollama service:
brew services start ollama # macOS
# or
ollama serve & # LinuxError: ollama.ResponseError: model 'xyz' not found
Solution: Pull the model first:
ollama pull hf.co/unsloth/Qwen3-14B-GGUF:Q4_K_XLOr use a different available model:
ollama pull llama2:latestError: Pydantic warnings or import errors
Solution: Ensure you're using Python 3.12:
python --version
# If not 3.12, recreate the virtual environment with Python 3.12
python3.12 -m venv .venv312
source .venv312/bin/activate
pip install -e .If the model runs out of memory during inference:
- Use a smaller model (e.g.,
llama2:latestinstead ofqwen3:14b) - Reduce
OLLAMA_CONTEXT_WINDOWinconfig.py - Reduce
OLLAMA_NUM_PARALLELinconfig.py(try 2 instead of 4) - Close other applications
- Increase system swap space
If requests are timing out or hanging:
Check concurrent load:
# Monitor Ollama connections
lsof -i :11434 | wc -l # Count active connectionsSolutions:
- Increase
REQUEST_TIMEOUTinconfig.pyfor complex queries - Reduce
OLLAMA_NUM_PARALLELif system is overloaded - Check Ollama logs:
ollama logsor check system console - Restart Ollama:
brew services restart ollama
Optimal settings by RAM:
- 8-16GB RAM:
OLLAMA_NUM_PARALLEL = 2 - 16-32GB RAM:
OLLAMA_NUM_PARALLEL = 4(default) - 32GB+ RAM:
OLLAMA_NUM_PARALLEL = 6-8
See CONCURRENT_REQUESTS.md for detailed tuning guide.
Error: Database connection or table access issues
Solution:
# Check LanceDB directory permissions
ls -la ./lancedb
# If corrupted, remove and restart (will lose cached data)
rm -rf ./lancedb
streamlit run app.py # Tables will be recreated
# To inspect LanceDB contents
python -c "import lancedb; db = lancedb.connect('./lancedb'); print(db.list_tables().tables)"Note: LanceDB is embedded and requires no separate server. All data is stored locally in the ./lancedb directory.
If cache seems corrupted or causes issues:
# Clear the KV cache
rm -rf .cache/kvcache/
# Or clear via the UI
# Click "π§Ή Clear Cache" in the sidebarIf cached answers seem outdated or incorrect:
# Clear Q&A cache via UI:
# 1. Expand "πΎ Q&A Cache Management" in sidebar
# 2. Click "ποΈ Clear All Q&A Cache"
# Or clear LanceDB cache programmatically:
python -c "from qa_cache import get_qa_cache; get_qa_cache().clear_all_cache()"
# Or remove the entire QA table:
python -c "import lancedb; db = lancedb.connect('./lancedb'); db.drop_table('qa_cache')"If you see repeated sections in the UI or logs:
Cause: Document loaded multiple times without clearing memory
Solution: This should be automatically prevented by the deduplication guards. If it still occurs:
# Restart the app (clears in-memory state)
pkill -f streamlit
streamlit run app.py
# Or clear LanceDB document cache
python -c "import lancedb; db = lancedb.connect('./lancedb'); db.drop_table('doc_sections')"If cited sections don't auto-expand in chat:
Check:
- LLM is citing sections by number (e.g., "Section 5.12.2") or title
- Document has been parsed with enhanced parser (not URL-only)
- Section titles match citation format
Debug: Check the logs for "Referenced sections" or "No section titles detected"
Based on the CAG paper's experiments:
- Small contexts (3-16 docs, ~21k tokens): CAG provides 10x+ speedup over dynamic context loading
- Medium contexts (4-32 docs, ~32-43k tokens): CAG offers 17x+ speedup
- Large contexts (7-64 docs, ~50-85k tokens): CAG achieves 40x+ speedup
The precomputed KV cache eliminates the need to reprocess documents for each query, making multi-turn conversations dramatically faster.
- Document Upload: User uploads files or provides URLs
- Conversion: Docling converts documents to plain text
- Context Preloading: Documents are concatenated and passed to the LLM
- KV Cache: Ollama automatically caches the model's inference state (handled internally)
- Query Processing: User questions are appended to the cached context
- Streaming Response: The model generates answers using the preloaded knowledge
- Document Upload: User uploads files or provides URLs
- Conversion: Docling converts documents to plain text
- Context Preloading: Documents are concatenated and passed to the LLM
- KV-Cache Creation: The model's inference state is precomputed and stored
- Efficient Queries: User questions are processed using the cached context
- Streaming Response: The model generates answers using preloaded knowledge
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CAGVAULT ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOCUMENT INGESTION PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β User Documents (PDF/TXT/MD/URL) β
β β β
β βΌ β
β βββββββββββββββββββββ β
β β Docling Parser β β Converts PDFs with layout preservation β
β β (skills/pdf_*) β β OCR support (optional) β
β ββββββββββ¬βββββββββββ β Table detection β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Enhanced Parser (LLM-Powered Analysis) β β
β β β β
β β β’ Hierarchical section extraction β β
β β β’ Parallel LLM importance scoring (4 workers) β β
β β β’ Credit analyst classification β β
β β β’ Page-accurate tracking (word-based) β β
β β β’ Named Entity Recognition (NER) β β
β β β’ Cross-reference detection β β
β ββββββββββ¬βββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SectionMemoryStore (In-Memory) β β
β β β β
β β β’ Hierarchical document structure β β
β β β’ Section β Subsection relationships β β
β β β’ Metadata indexing (pages, importance, type) β β
β β β’ Deduplication prevention β β
β ββββββββββ¬βββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββ β
β β LanceDB Persistent Storage (Embedded) β β
β β β β
β β Table: doc_sections β β
β β β’ Hierarchical sections (parent_id, order) β β
β β β’ Full-text search indexes (content, title) β β
β β β’ Pre-computed keywords & entities β β
β β β’ Document metadata (pages, type, size) β β
β β β β
β β In-Process Cache: 3s TTL DataFrame β β
β β β’ Sub-millisecond reads for frequent access β β
β β β’ Thread-safe with automatic invalidation β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SEARCH & RETRIEVAL LAYER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββ βββββββββββββββββββββ ββββββββββββββββββββββ β
β β Keyword Search β β Semantic Search β β Agentic Search β β
β β (FullTextSearch) β β (Embedding-based) β β (Claude-powered) β β
β ββββββββββ¬ββββββββββ ββββββββββ¬βββββββββββ βββββββββββ¬βββββββββββ β
β β β β β
β βββββββββββββββββββββββ΄βββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββ β
β β Search Results β β
β β + Relevance Scores β β
β β + Reasoning (Agentic) β β
β ββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CHAT & Q&A LAYER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β User Question β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Question Library (LanceDB) β β
β β β β
β β Table: question_library β β
β β β’ 15+ categories (Definitions, etc.) β β
β β β’ Usage tracking & popularity β β
β β β’ Autocomplete suggestions (FTS) β β
β β β’ Per-document & global questions β β
β β β’ In-process cache (3s TTL) β β
β ββββββββββ¬βββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Q&A Cache (LanceDB) β β
β β β β
β β Table: qa_cache β β
β β Key: sha256(question + doc_ids) β β
β β Value: {response, thinking, metadata} β β
β β FTS Index: question field β β
β β β β
β β Cache Hit? β Return cached response β β β
β β Cache Miss? β Continue to LLM β β β
β ββββββββββ¬βββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Context Builder β β
β β β β
β β β’ Load full document content β β
β β β’ Build hierarchical context β β
β β β’ Include section metadata β β
β ββββββββββ¬βββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β KV-Cache Manager β β
β β β β
β β β’ Precompute context state β β
β β β’ Track token counts β β
β β β’ Deduplicate sources β β
β β β’ Persistent disk storage β β
β β β β
β β 10-40x speedup for multi-turn chat! β β
β ββββββββββ¬βββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Ollama LLM Server (4 Parallel Workers)β β
β β β β
β β Model: Qwen3-14B (Q4_K_XL quantized) β β
β β Context: 8K+ tokens β β
β β Temperature: 0.0 (deterministic) β β
β β β β
β β ββββββββββββββββββββββββββββββ β β
β β β System Prompt β β β
β β β β’ Credit analyst expertise β β β
β β β β’ Cross-reference checking β β β
β β β β’ Citation requirements β β β
β β ββββββββββββββββββββββββββββββ β β
β ββββββββββ¬βββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Response Stream β β
β β β β
β β <think>...</think> β Reasoning β β
β β Answer β Final response β β
β β β β
β β β’ Auto-cache to LanceDB β β
β β β’ Extract section references β β
β β β’ Track to question library β β
β ββββββββββ¬βββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Referenced Section Matcher β β
β β β β
β β β’ Regex-based title matching β β
β β β’ Numeric prefix detection (5.12.2) β β
β β β’ Section/Β§ prefix variants β β
β β β’ Case-insensitive matching β β
β β β’ Subsection inclusion β β
β ββββββββββ¬βββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Streamlit UI Display: β
β β’ Chat messages β
β β’ Expandable thinking blocks β
β β’ Referenced section expanders with full content β
β β’ Cache status indicators β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA FLOW SUMMARY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. UPLOAD: PDF β Docling β Enhanced Parser β Section Analysis (parallel) β
β 2. STORE: Sections β Memory + LanceDB persistence β
β 3. INDEX: Keywords + Entities + Semantic embeddings β
β 4. QUERY: Question β Library + Q&A Cache check β
β 5. SEARCH: Keyword/Semantic/Agentic β Relevant sections β
β 6. BUILD: Context from sections β KV-Cache β
β 7. INFER: LLM with cached context β Streamed response β
β 8. MATCH: Extract section refs β Auto-expand in UI β
β 9. CACHE: Store Q&A + Update library + Track usage β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Ollama: Local LLM inference server (Qwen3-14B)
- LanceDB: Embedded vector database for persistent storage (Q&A cache, sections, questions)
- Streamlit: Interactive web UI with real-time updates
- LangChain: LLM orchestration and streaming
- Docling (
skills/pdf_parser/pdf_parser.py): PDF/HTML/TXT/MD conversion with layout preservation - EnhancedPDFParserSkill (
skills/pdf_parser/enhanced_parser.py):- LLM-powered section extraction
- Parallel importance scoring (ThreadPoolExecutor)
- Hierarchical structure with page tracking
- LanceDB persistence with deduplication guards
- SectionMemoryStore: In-memory hierarchical document structure
- NamedEntityRecognizer (
skills/pdf_parser/ner_search.py): Extract and index entities
- FullTextSearchEngine: Fast keyword search with tokenization
- Semantic Search: Embedding-based similarity matching
- Agentic Search: Claude-powered intelligent query understanding
- KVCacheManager (
kvcache.py): Context state caching with disk persistence - QACacheManager (
qa_cache.py): LanceDB-backed Q&A caching with persistent storage - QuestionLibraryManager (
question_library.py): Question tracking with categorization and usage analytics - LanceDBStore (
lancedb_cache.py): Unified storage layer with in-process DataFrame cache (3s TTL)
- CreditAnalystPrompt (
skills/pdf_parser/credit_analyst_prompt.py): Section classification and importance - LLMSectionEvaluator (
skills/pdf_parser/llm_section_evaluator.py): Batch analysis with parallel processing
Unified Storage Layer (lancedb_cache.py):
- Embedded Vector Database: No external server required, all data in
./lancedbdirectory - Three Main Tables:
- doc_sections: Hierarchical document sections with full-text search
- qa_cache: Question-answer pairs with thinking and metadata
- question_library: Popular questions with usage tracking and categorization
Schema Design:
# doc_sections table
document_id: string # Unique document identifier
document_name: string # Human-readable name
section_id: string # Section unique ID
parent_id: string # Parent section for hierarchy
level: int32 # Nesting level (1, 2, 3...)
order_idx: int32 # Preservation of document order
title: string # Section title
content: string # Section text content
keywords: list<string> # Pre-computed search tokens
entities_json: string # NER results (JSON)
metadata_json: string # Section metadata
total_pages: int32 # Document page count
extraction_method: string # Parser version/method
source: string # Origin (upload, URL, etc.)
stored_at: string # Timestamp (ISO 8601)
# qa_cache table
cache_key: string # SHA256 hash of question + doc_ids
question: string # Original question
response: string # LLM answer
thinking: string # Reasoning process
doc_ids: list<string> # Associated documents
timestamp: string # Cache creation time
metadata_json: string # Model, source count, etc.
# question_library table
question: string # Unique question text (normalized)
doc_ids: list<string> # Related documents
category: string # Question category
usage_count: int64 # Popularity metric
is_default: bool # Pre-seeded question
created_at: string # Creation timestamp
metadata_json: string # Additional metadataPerformance Optimizations:
-
Full-Text Search (FTS) Indexes:
- doc_sections:
content,title,document_name - qa_cache:
question - question_library:
question
- doc_sections:
-
In-Process DataFrame Cache (3-second TTL):
- Caches table contents as pandas DataFrames in memory
- Sub-millisecond reads for frequent queries
- Thread-safe with locks
- Automatic invalidation on writes
- Warmed on startup for instant first access
-
Write Strategy:
- Immediate writes to LanceDB (ACID-compliant)
- Cache invalidation triggered after successful write
- No blocking - operations complete quickly
Data Flow:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Application Request (Read) β
ββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Check In-Process Cache (3s TTL) β
β β’ Thread-safe lock acquisition β
β β’ Check timestamp validity β
ββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββ
β Hit β Miss
βΌ βΌ
βββββββββββββββββββ ββββββββββββββββββββββββ
β Return DataFrameβ β Query LanceDB Table β
β (sub-ms) β β β’ Convert to pandas β
βββββββββββββββββββ β β’ Store in cache β
β β’ Return DataFrame β
ββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Application Request (Write) β
ββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Write to LanceDB β
β β’ ACID transaction β
β β’ Immediate persistence β
ββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Invalidate In-Process Cache β
β β’ Remove cached DataFrame β
β β’ Next read will refresh from disk β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Migration from Redis:
- Optional one-time migration utility:
lancedb_cache.migrate_from_redis(redis_client) - Imports documents, Q&A cache, and question library
- Preserves all metadata and relationships
- No data loss during transition
1. KV-Cache (Context State)
- No document reprocessing: Once cached, documents aren't re-tokenized
- Multi-turn speedup: 10-40x faster for subsequent queries (from CAG paper)
- Memory efficient: Tracks token counts and cache size
- Automatic deduplication: Same documents aren't cached twice
- Persistent storage: Caches stored on disk for reuse across sessions
2. Q&A Cache (Response Level)
- Instant retrieval: Identical questions return cached answers immediately
- Document-aware: Cache keys include document IDs for precise matching
- Persistent storage: No expiration, manually managed via UI or API
- Thinking included: Caches both reasoning and final response
- Per-document management: Clear cache for specific documents
3. Document Section Cache (LanceDB)
- Parse once: Parsed sections persisted to LanceDB with FTS indexes
- Fast reload: Load document structure without re-parsing (in-process cache)
- Hierarchical storage: Maintains parent-child relationships via order_idx
- Search index: Pre-computed keywords and entities with full-text search
- Deduplication guards: Prevents repeated section additions
- In-process cache: 3-second TTL DataFrame cache for frequent reads
Concurrent Request Handling
- 4 parallel LLM workers handle simultaneous requests
- Non-blocking chat responses during document processing
- Multiple users can interact concurrently
- Configurable via
Config.OLLAMA_NUM_PARALLEL - 5-minute request timeout prevents hanging operations
- See CONCURRENT_REQUESTS.md for detailed configuration
Section Analysis (4 workers)
- Concurrent LLM calls for importance scoring
- Classification of section types (COVENANT, DEFAULT, etc.)
- Batch processing of subsections
- Progress logging every 10 sections
Word-Based Page Estimation
- ~250 words per page heuristic
- Instant calculation vs. slow LLM page range calls
- Accurate enough for UI display and citations
In-Memory Section Store
- Fast lookups by section ID
- Hierarchical traversal for subsections
- Automatic memory clearing before fresh loads
- Prevents duplicate section accumulation
- Upload Full Agreement: Include all sections, schedules, and amendments
- Let Parsing Complete: Wait for parallel LLM analysis to finish (progress shown)
- Use Agentic Search: For complex queries, agentic search provides reasoning
- Check Referenced Sections: Always expand cited sections to verify context
- Review Cache: Use Q&A cache management to track analysis history
- Enable Redis: Install and run Redis for best caching performance
- Batch Upload: Upload all related documents before starting Q&A
- Use Suggested Questions: Build question library for faster team collaboration
- Monitor Cache Stats: Clear old caches periodically to free memory
- Parallel Processing: Parser uses 4 workers by default; increase for faster analysis
- Categorize Thoughtfully: Questions are auto-categorized but review for accuracy
- Track Usage: Popular questions surface to the top automatically
- Search Before Asking: Use autocomplete to find existing answers
- Document-Specific: Filter questions by document for focused analysis
- Clear Periodically: Remove outdated questions to keep library relevant
- Related Documents: Upload contracts and amendments together
- Clear Context Cache: When switching document sets, clear cache
- Check Message Source IDs: Verify which documents are in context
- Redis Loading: For frequently used documents, load from Redis cache
- Qwen3-14B: ~8K tokens (~3-4 medium PDFs or 1 large credit agreement)
- Token Estimation: ~750 tokens per page for dense legal documents
- Workaround: Focus on specific sections or use search to find relevant parts
- Minimum: 8GB RAM for 7B models
- Recommended: 16GB RAM for 14B models
- With Redis: Additional ~100MB-1GB depending on document count
- Section Analysis: Uses 4 parallel workers (can adjust in code)
- Optional: App works without Redis but with limited features
- Q&A Cache: Requires Redis for persistence
- Question Library: Requires Redis for cross-session storage
- Document Sections: Can use memory-only but won't persist
- Constantly Updating Knowledge: Traditional RAG better for dynamic data
- Very Large Corpora: 100+ documents may exceed context limits
- Real-Time Collaboration: Single-user app, not designed for teams
- Production Deployments: This is a research/analysis tool, not a production service
- β Parallel LLM Section Analysis: 4 concurrent workers for faster parsing
- β Credit Analyst Classification: Automatic detection of COVENANTS, DEFAULTS, etc.
- β Importance Scoring: AI-driven relevance analysis (0-1 scale)
- β Page-Accurate Tracking: Word-based estimation for instant page mapping
- β Hierarchical Sections: Full parent-child relationships preserved
- β Multi-Modal Search: Keyword, semantic, and agentic (Claude-powered)
- β Named Entity Recognition: Extract PARTY, DATE, MONEY, AGREEMENT entities
- β Entity Filtering: Browse by entity type across all sections
- β Section References: Auto-expand cited sections in chat responses
- β Q&A Cache: LanceDB-backed with persistent storage
- β Question Library: 15+ categories with autocomplete
- β Suggested Questions: Popular queries by document or global
- β Cache Analytics: Real-time stats and management UI
- β Deduplication Guards: Prevent repeated section additions
- β Document Tabs: Sections, Search, Entities in organized tabs
- β Cache Indicators: Visual feedback for cache hits
- β Referenced Section Expanders: Click to view full cited sections
- β Browse by Category: Explore questions by type
- β LanceDB Document Picker: Load previously parsed documents
- β Concurrent Request Handling: 4 parallel LLM workers for simultaneous requests
- β Memory Management: Automatic clearing before fresh loads
- β Parallel Processing: ThreadPoolExecutor for section analysis
- β LanceDB Persistence: Store parsed sections with FTS indexes for instant reload
- β Word-Based Estimation: Fast page calculation without LLM calls
- β Connection Pooling: Optimized Ollama connections with timeout management
- β Python 3.14 Support: Compatible with latest Python
- β Embedded Storage: No external database server required
- β Enhanced Error Handling: Better logging and fallbacks
- β Document Deduplication: Prevent duplicate button keys
If you use this project or the CAG methodology, please cite the original paper:
@inproceedings{chan2025cag,
title={Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks},
author={Chan, Brian J and Chen, Chao-Ting and Cheng, Jui-Hung and Huang, Hen-Hsen},
booktitle={Proceedings of the ACM Web Conference 2025},
year={2025}
}MIT License - See LICENSE file for details
Contributions welcome! Please open an issue or submit a pull request.
Created by Amitabha Karmakar
For Issues or Questions:
- Check the Troubleshooting section above
- Review the Best Practices for optimal usage
- Check logs in the terminal where you ran
streamlit run app.py - Open a GitHub issue with:
- Error message and full traceback
- Python version (
python --version) - Ollama status (
ollama list) - LanceDB tables (
python -c "import lancedb; print(lancedb.connect('./lancedb').list_tables().tables)") - Steps to reproduce
Documentation:
- CAG Paper: https://arxiv.org/abs/2412.15605v1
- Implementation Details:
documentation/AGENTIC_RAG_GUIDE.md- NEW! Multi-step reasoning RAG systemdocumentation/AGENT_SDK_INTEGRATION.md- NEW! Claude Agent SDK MCP toolsdocumentation/MCP_TOOLS_GUIDE.md- MCP tools user guidedocumentation/QA_CACHE_IMPLEMENTATION.md- Q&A caching systemdocumentation/QUESTION_LIBRARY_IMPLEMENTATION.md- Question library designdocumentation/PDF_PARSER_SKILL_SUMMARY.md- Enhanced PDF parsingdocumentation/CLAUDE_SKILLS_GUIDE.md- Claude skills integrationskills/pdf_parser/ENHANCED_PARSER_GUIDE.md- Advanced document parsing
Logs & Debugging:
# Check terminal output for detailed logs
# Logs include:
# - Section extraction progress
# - LLM analysis status
# - Cache hits/misses
# - LanceDB storage status
# - Entity extraction results
# Enable more verbose logging (if needed):
export LOG_LEVEL=DEBUG
streamlit run app.pyBuilt with β€οΈ using Qwen3, Ollama, LangChain, Docling, and Streamlit