Compass

Confidently navigate LLM deployments from concept to production.

Overview

The system addresses a critical challenge: how do you translate business requirements into the right model and infrastructure choices without expensive trial-and-error?

Compass guides you from concept to production LLM deployments through SLO-driven capacity planning. Conversationally define your requirements—Compass translates them into traffic profiles, performance targets, and cost constraints. Get intelligent model and GPU recommendations based on real benchmarks. Explore alternatives, compare tradeoffs, deploy with one click, and monitor actual performance—staying on course as your needs evolve.

The code in this repository implements the Compass Phase 2 MVP with production-grade data management. Phase 1 (POC) demonstrated the end-to-end workflow with synthetic data. Phase 2 adds PostgreSQL for benchmark storage, a traffic profile framework aligned with GuideLLM standards, experience-driven SLO mapping, and p95 percentile targets for conservative guarantees.

Key Features

🗣️ Conversational Requirements Gathering - Describe your use case in natural language
📊 SLO-Driven Capacity Planning - Translate business needs into technical specifications (traffic profiles, latency targets, cost constraints)
🎯 Intelligent Recommendations - Get optimal model + GPU configurations backed by real benchmark data
🔍 What-If Analysis - Explore alternatives and compare cost vs. latency tradeoffs
⚡ One-Click Deployment - Generate production-ready KServe/vLLM YAML and deploy to Kubernetes
📈 Performance Monitoring - Track actual deployment status and test inference in real-time
💻 GPU-Free Development - vLLM simulator enables local testing without GPU hardware

How It Works

Extract Intent - LLM-powered analysis converts your description into structured requirements
Map to Traffic Profile - Match use case to one of 4 GuideLLM benchmark configurations
Set SLO Targets - Auto-generate TTFT (p95), ITL (p95), and E2E (p95) targets based on experience class
Query Benchmarks - Exact match on (model, GPU, traffic profile) from PostgreSQL database
Filter by SLOs - Find configurations meeting all p95 latency targets
Plan Capacity - Calculate required replicas based on throughput requirements
Generate & Deploy - Create validated Kubernetes YAML and deploy to local or production clusters
Monitor & Validate - Track deployment status and test inference endpoints

Prerequisites

Required before running make setup:

macOS or Linux (Windows via WSL2)
Docker Desktop (must be running)

Installed automatically by make setup:

Python 3.11+
Ollama - brew install ollama
kubectl - brew install kubectl
KIND - brew install kind

Quick Start

Get up and running in 4 commands:

make setup             # Install dependencies, pull Ollama model
make postgres-start    # Start PostgreSQL container (Phase 2)
make cluster-start     # Create local KIND cluster with vLLM simulator
make dev               # Start all services (Ollama + Backend + UI)

Then open http://localhost:8501 in your browser.

Note: PostgreSQL runs as a Docker container (compass-postgres) with benchmark data. Use make postgres-init to initialize the schema and make postgres-load-synthetic to load benchmark data.

Stop everything:

make stop           # Stop services
make cluster-stop   # Delete cluster (optional)

Using Compass

Describe your use case in the chat interface
- Example: "I need a customer service chatbot for 5000 users with low latency"
Review recommendations - Model, GPU configuration, SLO predictions, costs
Edit specifications if needed (traffic, SLO targets, constraints)
Generate deployment YAML - Click "Generate Deployment YAML"
Deploy to cluster - Click "Deploy to Kubernetes"
Monitor deployment - Switch to "Deployment Management" tab to see status
Test inference - Send test prompts once deployment is Ready

Demo Scenarios

The POC includes 3 pre-configured scenarios (see data/demo_scenarios.json):

Customer Service Chatbot - High volume (5000 users), strict latency (<500ms)
- Expected: Llama 3.1 8B on 2x A100-80GB
Code Generation Assistant - Developer team (500 users), quality > speed
- Expected: Llama 3.1 70B on 4x A100-80GB (tensor parallel)
Document Summarization - Batch processing (2000 users/day), cost-sensitive
- Expected: Mistral 7B on 2x A10G

Architecture Highlights

Compass implements an 8-component architecture with:

Conversational Interface (Streamlit) - Chat-based requirement gathering with interactive exploration
Context & Intent Engine - LLM-powered extraction of deployment specs
Recommendation Engine - Traffic profiling, model scoring, capacity planning
Deployment Automation - YAML generation and Kubernetes deployment
Knowledge Base - Benchmarks, SLO templates, model catalog
LLM Backend - Ollama (llama3.1:8b) for conversational AI
Orchestration - Multi-step workflow coordination
Inference Observability - Real-time deployment monitoring

Development Tools:

vLLM Simulator - GPU-free local development and testing

See docs/ARCHITECTURE.md for detailed system design.

Implemented Features

✅ Foundation: Project structure, synthetic data, LLM client (Ollama)
✅ Core Recommendation Engine: Intent extraction, traffic profiling, model recommendation, capacity planning
✅ FastAPI Backend: REST endpoints, orchestration workflow, knowledge base access
✅ Streamlit UI: Chat interface, recommendation display, specification editor
✅ Deployment Automation: YAML generation (KServe/vLLM/HPA/ServiceMonitor), Kubernetes deployment
✅ Local Kubernetes: KIND cluster support, KServe installation, cluster management
✅ vLLM Simulator: GPU-free development mode with realistic latency simulation
✅ Monitoring & Testing: Real-time deployment status, inference testing UI, cluster observability

Key Technologies

Component	Technology
Backend	FastAPI, Pydantic
Frontend	Streamlit
LLM	Ollama (llama3.1:8b)
Data	PostgreSQL (Phase 2), psycopg2, JSON (Phase 1 - deprecated)
YAML Generation	Jinja2 templates
Kubernetes	KIND (local), KServe v0.13.0
Deployment	kubectl, Kubernetes Python client

Development Commands

make help                    # Show all available commands
make dev                     # Start all services (Ollama + Backend + UI)
make stop                    # Stop all services
make restart                 # Restart all services
make logs-backend            # Tail backend logs
make logs-ui                 # Tail UI logs

# PostgreSQL
make postgres-start          # Start PostgreSQL container
make postgres-init           # Initialize schema
make postgres-load-synthetic # Load synthetic benchmark data
make postgres-shell          # Open PostgreSQL shell

# Kubernetes
make cluster-status          # Check Kubernetes cluster status
make clean-deployments       # Delete all InferenceServices

# Testing
make test                    # Run unit tests
make test-integration        # Run integration tests (requires Ollama)
make test-e2e                # Run end-to-end tests (requires cluster)

make clean                   # Remove generated files

vLLM Simulator Mode

Compass includes a GPU-free simulator for local development:

No GPU required - Run deployments on any laptop
OpenAI-compatible API - /v1/completions and /v1/chat/completions
Realistic latency - Uses benchmark data to simulate TTFT/ITL
Fast deployment - Pods become Ready in ~10-15 seconds

Simulator Mode (default):

# In backend/src/api/routes.py
deployment_generator = DeploymentGenerator(simulator_mode=True)

Production Mode (requires GPU cluster):

deployment_generator = DeploymentGenerator(simulator_mode=False)

See docs/DEVELOPER_GUIDE.md for details.

Documentation

Developer Guide - Development workflows, testing, debugging
Architecture - Detailed system design and component specifications
Traffic and SLOs - Traffic profile framework and experience-driven SLOs (Phase 2)
PostgreSQL Migration Plan - Phase 2 migration details
Architecture Diagrams - Visual system representations
Logging Guide - Logging system and debugging
Claude Code Guidance - AI assistant instructions for contributors

Phase 2 Completed Features

Phase 2 MVP improvements (now complete):

✅ PostgreSQL Database - Production-grade benchmark storage with psycopg2
✅ Traffic Profile Framework - 4 GuideLLM standard configurations: (512→256), (1024→1024), (4096→512), (10240→1536)
✅ Experience-Driven SLOs - 9 use cases mapped to 5 experience classes (instant, conversational, interactive, deferred, batch)
✅ p95 Percentiles - More conservative SLO guarantees (changed from p90)
✅ ITL Terminology - Inter-Token Latency instead of TPOT (Time Per Output Token)
✅ Exact Traffic Matching - No fuzzy matching, exact (prompt_tokens, output_tokens) queries
✅ Pre-calculated E2E - E2E latency stored in benchmarks for accuracy
✅ Enhanced SLO Filtering - Find configurations meeting all p95 targets

Future Enhancements (Phase 3+)

Production-Grade Ingress - External access with TLS, authentication, rate limiting
Production GPU Validation - End-to-end testing with real GPU clusters
Feedback Loop - Actual metrics → benchmark updates
Statistical Traffic Models - Full distributions (not just point estimates)
Multi-Dimensional Benchmarks - Concurrency, batching, KV cache effects
Security Hardening - YAML validation, RBAC, network policies
Multi-Tenancy - Namespaces, resource quotas, isolation
Advanced Simulation - SimPy, Monte Carlo for what-if analysis

Contributing

We are early in development of this project, but contributions are welcome.

See CLAUDE.md for AI assistant guidance when making changes.

License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Compass

Overview

Key Features

How It Works

Prerequisites

Quick Start

Using Compass

Demo Scenarios

Architecture Highlights

Implemented Features

Key Technologies

Development Commands

vLLM Simulator Mode

Documentation

Phase 2 Completed Features

Future Enhancements (Phase 3+)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.streamlit		.streamlit
backend		backend
config		config
data		data
docs		docs
generated_configs		generated_configs
scripts		scripts
simulator		simulator
tests		tests
ui		ui
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

redhat-et/compass

Folders and files

Latest commit

History

Repository files navigation

Compass

Overview

Key Features

How It Works

Prerequisites

Quick Start

Using Compass

Demo Scenarios

Architecture Highlights

Implemented Features

Key Technologies

Development Commands

vLLM Simulator Mode

Documentation

Phase 2 Completed Features

Future Enhancements (Phase 3+)

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages