Confidently navigate LLM deployments from concept to production.
The system addresses a critical challenge: how do you translate business requirements into the right model and infrastructure choices without expensive trial-and-error?
Compass guides you from concept to production LLM deployments through SLO-driven capacity planning. Conversationally define your requirements—Compass translates them into traffic profiles, performance targets, and cost constraints. Get intelligent model and GPU recommendations based on real benchmarks. Explore alternatives, compare tradeoffs, deploy with one click, and monitor actual performance—staying on course as your needs evolve.
The code in this repository implements the Compass Phase 2 MVP with production-grade data management. Phase 1 (POC) demonstrated the end-to-end workflow with synthetic data. Phase 2 adds PostgreSQL for benchmark storage, a traffic profile framework aligned with GuideLLM standards, experience-driven SLO mapping, and p95 percentile targets for conservative guarantees.
- 🗣️ Conversational Requirements Gathering - Describe your use case in natural language
- 📊 SLO-Driven Capacity Planning - Translate business needs into technical specifications (traffic profiles, latency targets, cost constraints)
- 🎯 Intelligent Recommendations - Get optimal model + GPU configurations backed by real benchmark data
- 🔍 What-If Analysis - Explore alternatives and compare cost vs. latency tradeoffs
- ⚡ One-Click Deployment - Generate production-ready KServe/vLLM YAML and deploy to Kubernetes
- 📈 Performance Monitoring - Track actual deployment status and test inference in real-time
- 💻 GPU-Free Development - vLLM simulator enables local testing without GPU hardware
- Extract Intent - LLM-powered analysis converts your description into structured requirements
- Map to Traffic Profile - Match use case to one of 4 GuideLLM benchmark configurations
- Set SLO Targets - Auto-generate TTFT (p95), ITL (p95), and E2E (p95) targets based on experience class
- Query Benchmarks - Exact match on (model, GPU, traffic profile) from PostgreSQL database
- Filter by SLOs - Find configurations meeting all p95 latency targets
- Plan Capacity - Calculate required replicas based on throughput requirements
- Generate & Deploy - Create validated Kubernetes YAML and deploy to local or production clusters
- Monitor & Validate - Track deployment status and test inference endpoints
Required before running make setup:
- macOS or Linux (Windows via WSL2)
- Docker Desktop (must be running)
Installed automatically by make setup:
- Python 3.11+
- Ollama -
brew install ollama - kubectl -
brew install kubectl - KIND -
brew install kind
Get up and running in 4 commands:
make setup # Install dependencies, pull Ollama model
make postgres-start # Start PostgreSQL container (Phase 2)
make cluster-start # Create local KIND cluster with vLLM simulator
make dev # Start all services (Ollama + Backend + UI)Then open http://localhost:8501 in your browser.
Note: PostgreSQL runs as a Docker container (compass-postgres) with benchmark data. Use make postgres-init to initialize the schema and make postgres-load-synthetic to load benchmark data.
Stop everything:
make stop # Stop services
make cluster-stop # Delete cluster (optional)- Describe your use case in the chat interface
- Example: "I need a customer service chatbot for 5000 users with low latency"
- Review recommendations - Model, GPU configuration, SLO predictions, costs
- Edit specifications if needed (traffic, SLO targets, constraints)
- Generate deployment YAML - Click "Generate Deployment YAML"
- Deploy to cluster - Click "Deploy to Kubernetes"
- Monitor deployment - Switch to "Deployment Management" tab to see status
- Test inference - Send test prompts once deployment is Ready
The POC includes 3 pre-configured scenarios (see data/demo_scenarios.json):
-
Customer Service Chatbot - High volume (5000 users), strict latency (<500ms)
- Expected: Llama 3.1 8B on 2x A100-80GB
-
Code Generation Assistant - Developer team (500 users), quality > speed
- Expected: Llama 3.1 70B on 4x A100-80GB (tensor parallel)
-
Document Summarization - Batch processing (2000 users/day), cost-sensitive
- Expected: Mistral 7B on 2x A10G
Compass implements an 8-component architecture with:
- Conversational Interface (Streamlit) - Chat-based requirement gathering with interactive exploration
- Context & Intent Engine - LLM-powered extraction of deployment specs
- Recommendation Engine - Traffic profiling, model scoring, capacity planning
- Deployment Automation - YAML generation and Kubernetes deployment
- Knowledge Base - Benchmarks, SLO templates, model catalog
- LLM Backend - Ollama (llama3.1:8b) for conversational AI
- Orchestration - Multi-step workflow coordination
- Inference Observability - Real-time deployment monitoring
Development Tools:
- vLLM Simulator - GPU-free local development and testing
See docs/ARCHITECTURE.md for detailed system design.
- ✅ Foundation: Project structure, synthetic data, LLM client (Ollama)
- ✅ Core Recommendation Engine: Intent extraction, traffic profiling, model recommendation, capacity planning
- ✅ FastAPI Backend: REST endpoints, orchestration workflow, knowledge base access
- ✅ Streamlit UI: Chat interface, recommendation display, specification editor
- ✅ Deployment Automation: YAML generation (KServe/vLLM/HPA/ServiceMonitor), Kubernetes deployment
- ✅ Local Kubernetes: KIND cluster support, KServe installation, cluster management
- ✅ vLLM Simulator: GPU-free development mode with realistic latency simulation
- ✅ Monitoring & Testing: Real-time deployment status, inference testing UI, cluster observability
| Component | Technology |
|---|---|
| Backend | FastAPI, Pydantic |
| Frontend | Streamlit |
| LLM | Ollama (llama3.1:8b) |
| Data | PostgreSQL (Phase 2), psycopg2, JSON (Phase 1 - deprecated) |
| YAML Generation | Jinja2 templates |
| Kubernetes | KIND (local), KServe v0.13.0 |
| Deployment | kubectl, Kubernetes Python client |
make help # Show all available commands
make dev # Start all services (Ollama + Backend + UI)
make stop # Stop all services
make restart # Restart all services
make logs-backend # Tail backend logs
make logs-ui # Tail UI logs
# PostgreSQL
make postgres-start # Start PostgreSQL container
make postgres-init # Initialize schema
make postgres-load-synthetic # Load synthetic benchmark data
make postgres-shell # Open PostgreSQL shell
# Kubernetes
make cluster-status # Check Kubernetes cluster status
make clean-deployments # Delete all InferenceServices
# Testing
make test # Run unit tests
make test-integration # Run integration tests (requires Ollama)
make test-e2e # Run end-to-end tests (requires cluster)
make clean # Remove generated filesCompass includes a GPU-free simulator for local development:
- No GPU required - Run deployments on any laptop
- OpenAI-compatible API -
/v1/completionsand/v1/chat/completions - Realistic latency - Uses benchmark data to simulate TTFT/ITL
- Fast deployment - Pods become Ready in ~10-15 seconds
Simulator Mode (default):
# In backend/src/api/routes.py
deployment_generator = DeploymentGenerator(simulator_mode=True)Production Mode (requires GPU cluster):
deployment_generator = DeploymentGenerator(simulator_mode=False)See docs/DEVELOPER_GUIDE.md for details.
- Developer Guide - Development workflows, testing, debugging
- Architecture - Detailed system design and component specifications
- Traffic and SLOs - Traffic profile framework and experience-driven SLOs (Phase 2)
- PostgreSQL Migration Plan - Phase 2 migration details
- Architecture Diagrams - Visual system representations
- Logging Guide - Logging system and debugging
- Claude Code Guidance - AI assistant instructions for contributors
Phase 2 MVP improvements (now complete):
- ✅ PostgreSQL Database - Production-grade benchmark storage with psycopg2
- ✅ Traffic Profile Framework - 4 GuideLLM standard configurations: (512→256), (1024→1024), (4096→512), (10240→1536)
- ✅ Experience-Driven SLOs - 9 use cases mapped to 5 experience classes (instant, conversational, interactive, deferred, batch)
- ✅ p95 Percentiles - More conservative SLO guarantees (changed from p90)
- ✅ ITL Terminology - Inter-Token Latency instead of TPOT (Time Per Output Token)
- ✅ Exact Traffic Matching - No fuzzy matching, exact (prompt_tokens, output_tokens) queries
- ✅ Pre-calculated E2E - E2E latency stored in benchmarks for accuracy
- ✅ Enhanced SLO Filtering - Find configurations meeting all p95 targets
- Production-Grade Ingress - External access with TLS, authentication, rate limiting
- Production GPU Validation - End-to-end testing with real GPU clusters
- Feedback Loop - Actual metrics → benchmark updates
- Statistical Traffic Models - Full distributions (not just point estimates)
- Multi-Dimensional Benchmarks - Concurrency, batching, KV cache effects
- Security Hardening - YAML validation, RBAC, network policies
- Multi-Tenancy - Namespaces, resource quotas, isolation
- Advanced Simulation - SimPy, Monte Carlo for what-if analysis
We are early in development of this project, but contributions are welcome.
See CLAUDE.md for AI assistant guidance when making changes.
This project is licensed under Apache License 2.0. See the LICENSE file for details.