A Text-to-SQL engine that converts natural language questions into SQL queries to analyze cloud cost data from AWS and Azure.
- π£οΈ Natural Language Processing - Ask questions in plain English
- π€ FREE Local LLM - Uses Ollama (no API keys, 100% private)
- π€ Dynamic Agentic Clarification - Smart detection of ambiguous queries with database-driven options
- π Intelligent SQL Validation - Auto-fix for UNION syntax, GROUP BY errors, and date filters
- π Interactive Dashboard - Built with Streamlit
- π RESTful API - FastAPI endpoint for programmatic access
- π Auto Visualizations - Charts and graphs generated automatically
- βοΈ Multi-Cloud Support - Analyze AWS and Azure costs together
- π Query History - Track and reuse previous queries
- π Multiple Interfaces - Web UI, REST API, and CLI
- π‘οΈ Robust Error Prevention - Multi-layer validation prevents SQL errors before execution
- Python 3.11+
- Poetry (dependency management)
- Ollama (optional - for AI-powered queries)
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -# Clone the repository
git clone <your-repo-url>
cd CloudCost-Intelligence
# Install dependencies
poetry install
# Initialize database with sample data
poetry run python database_manager.py# Download from: https://ollama.ai/download
# Then pull the model:
ollama pull llama3.2Choose one of the three interfaces:
# Direct Poetry command
poetry run streamlit run app.pyThe app will open at http://localhost:8501
# Run with Poetry
poetry run python api.pyThe API will be available at:
- Interactive Docs:
http://localhost:8000/docs(Swagger UI) - Alternative Docs:
http://localhost:8000/redoc(ReDoc) - API Endpoint:
http://localhost:8000
Example API Usage:
# Test the API with curl
curl -X POST http://localhost:8000/query `
-H "Content-Type: application/json" `
-d '{\"question\": \"What is the total AWS cost?\", \"explain\": true}'
# Get database statistics
curl http://localhost:8000/stats
# Get example queries
curl http://localhost:8000/examples# Interactive mode
poetry run python cli.py
# Or run directly if environment is activated
python cli.pyTry asking:
- "What is the total AWS cost?"
- "Show me top 10 services by cost"
- "What are the costs by region?"
- "Show me Azure costs by service"
- "What is the average cost per service?"
- "Show me the top 5 most expensive regions"
The Agentic Clarification feature intelligently detects ambiguous queries and asks follow-up questions with dynamically generated options from your actual database to gather missing context before execution.
The system automatically identifies these types of ambiguities:
-
Missing Time Range π
- Query: "What is the total cost?"
- Clarifies: Which time period? (Last 7 days, last month, YTD, etc.)
-
Missing Cloud Provider βοΈ
- Query: "Show me EC2 costs"
- Clarifies: AWS only, Azure only, or both?
-
Ambiguous Region π
- Query: "Show me regional costs"
- Clarifies with actual regions from your database: us-east-1, eu-west-1, etc. (AWS) or East US, West Europe, etc. (Azure)
- Dynamic extraction: Regions are pulled from database in real-time, not hardcoded
-
Ambiguous Cost Metric π°
- Query: "Show me the cost"
- Clarifies: Billed cost (invoice) or effective cost (with discounts)?
-
Missing Limit in Top-N Queries π’
- Query: "Show me top services"
- Clarifies: Top 5, 10, 20, or 50?
NEW: Clarification options are now generated dynamically from your actual database:
# Regions are extracted live from database
aws_regions = ["us-east-1", "eu-west-1", "ap-southeast-2", ...] # From actual data
azure_regions = ["East US", "West Europe", "Australia East", ...] # From actual data
# Time ranges calculated from actual billing periods
time_ranges = [
"Last 7 days (data from 2024-01-23 to 2024-01-30)",
"Last 30 days (data from 2023-12-31 to 2024-01-30)",
"This month (January 2024)",
# ... based on your actual data
]Benefits:
- β Always shows relevant options based on your data
- β No hardcoded values that might not exist in your database
- β Provider-specific regions (AWS vs Azure automatically separated)
- β Adapts as your data changes
# Run the clarification demo
poetry run python agentic_clarification.pyThis will show test cases demonstrating how the clarifier works:
π Query: Show me EC2 cost
β Which time range would you like me to use?
Options:
- Last 7 days (last_7_days)
- Last 30 days (last_30_days)
- This month (this_month)
- Year to date (year_to_date)
Missing: time_range
β
Enhanced: Show me EC2 cost in the last 7 days
from agentic_clarification import AgenticClarifier
# Initialize clarifier
clarifier = AgenticClarifier()
# Analyze a query
result = clarifier.analyze_query("What is the total cost?")
if result["needs_clarification"]:
# Display question to user
print(result["question"])
# Show options
for option in result["options"]:
print(f"- {option['label']}")
# User selects an option (e.g., "last_30_days")
selected_value = "last_30_days"
# Apply context to enhance query
enhanced_query = clarifier.apply_context(
original_query="What is the total cost?",
context_key=result["missing_context"][0],
context_value=selected_value
)
print(f"Enhanced query: {enhanced_query}")
# Output: "What is the total cost? in the last 30 days"The clarification system can be integrated into:
Web UI (Streamlit):
# Before executing query
clarifier = AgenticClarifier()
result = clarifier.analyze_query(user_query)
if result["needs_clarification"]:
# Show selectbox with options
selected = st.selectbox(result["question"],
[opt["label"] for opt in result["options"]])
# Find selected value
selected_value = next(opt["value"] for opt in result["options"]
if opt["label"] == selected)
# Enhance query
user_query = clarifier.apply_context(user_query,
result["missing_context"][0],
selected_value)REST API (FastAPI):
@app.post("/query/clarify")
async def clarify_query(question: str):
clarifier = AgenticClarifier()
result = clarifier.analyze_query(question)
return result
@app.post("/query/apply-context")
async def apply_context(query: str, context_key: str, context_value: str):
clarifier = AgenticClarifier()
enhanced = clarifier.apply_context(query, context_key, context_value)
return {"enhanced_query": enhanced}CLI:
result = clarifier.analyze_query(user_input)
if result["needs_clarification"]:
print(f"\n{result['question']}")
for i, opt in enumerate(result["options"], 1):
print(f"{i}. {opt['label']}")
choice = int(input("Select option: ")) - 1
user_input = clarifier.apply_context(user_input,
result["missing_context"][0],
result["options"][choice]["value"])| Ambiguous Query | Clarification Question | Options |
|---|---|---|
| "Show me EC2 cost" | Which time range would you like me to use? | Last 7 days, Last 30 days, This month, YTD, etc. |
| "Show me the cost" | Which cost metric would you like to use? | Billed Cost, Effective Cost |
| "Show me top services" | How many results would you like to see? | Top 5, Top 10, Top 20, Top 50 |
| "What are regional costs?" | Which region would you like to analyze? | All regions, us-east-1, eu-west-1, etc. |
| "Show me S3 costs" | Which cloud provider would you like to analyze? | AWS, Azure, Both |
Dynamic Database Integration: The clarifier queries your actual database:
- Extracts DISTINCT regions from aws_cost_usage and azure_cost_usage tables
- Separates AWS regions from Azure regions automatically
- Calculates actual date ranges from billing data
- Returns only relevant options (no non-existent values)
Context Application: The clarifier doesn't just ask questionsβit intelligently applies your selections:
- Time ranges β Added as temporal constraints
- Providers β Added as table filters
- Limits β Inserted into TOP-N clauses
- Metrics β Replaces generic "cost" with specific column
- Regions β Added as WHERE clause filters with actual region names
Query Enhancement:
Original: "Show me top services"
Enhanced: "Show me top 10 services in the last 30 days for AWS"
- β Reduces Errors - Catches missing context before execution
- β Saves Time - No need to reformulate queries manually
- β Better Results - Ensures you get exactly what you're looking for
- β User-Friendly - Guides users with helpful options
- β Flexible - Works with any interface (Web, API, CLI)
- β Data-Driven - Options generated from your actual database, not hardcoded
The engine includes a comprehensive multi-layer SQL validation system that automatically detects and fixes common SQL errors before execution, preventing failures and ensuring reliable results.
Problem: LLMs often generate invalid UNION queries with ORDER BY before UNION ALL
Example Error:
-- β INCORRECT (generated by LLM)
SELECT servicename, SUM(billedcost) FROM aws_cost_usage
GROUP BY servicename
ORDER BY total_cost DESC -- ERROR: ORDER BY before UNION
UNION ALL
SELECT servicename, SUM(billedcost) FROM azure_cost_usage
GROUP BY servicename
LIMIT 10Auto-Fix Applied:
-- β
CORRECTED (by validation system)
SELECT servicename, SUM(billedcost) FROM aws_cost_usage
GROUP BY servicename
UNION ALL
SELECT servicename, SUM(billedcost) FROM azure_cost_usage
GROUP BY servicename
ORDER BY total_cost DESC -- Moved to end
LIMIT 10How It Works:
- Pattern matching detects ORDER BY/LIMIT before UNION
- Extracts and removes them from individual SELECTs
- Re-appends after the entire UNION query
- Validates final SQL structure
Problem: Date filters fail when data has malformed timestamps
Example Error:
Query returned 0 rows - date filter may not match data format
Date values like: 00:00.0 (malformed)
Auto-Fix Applied:
-- β BEFORE: Query with date filter
SELECT * FROM aws_cost_usage
WHERE billingperiodstart BETWEEN date('2024-01-01') AND date('2024-01-31')
-- β
AFTER: Date filter removed + warning shown
SELECT * FROM aws_cost_usage
-- Warning: Date filter removed due to malformed date dataHow It Works:
- Detects BETWEEN clauses with date() functions
- Checks for malformed date data (patterns like
00:00.0) - Removes problematic date filters
- Displays user-friendly warning message
Problem: Columns in GROUP BY but not in SELECT clause
Example Error:
-- β INCORRECT
SELECT SUM(billedcost) as total_cost
FROM aws_cost_usage
GROUP BY servicename, regionname -- Error: columns not in SELECTAuto-Fix Applied:
-- β
CORRECTED
SELECT servicename, regionname, SUM(billedcost) as total_cost
FROM aws_cost_usage
GROUP BY servicename, regionname -- Now columns are in SELECTHow It Works:
- Extracts GROUP BY columns from query
- Validates they exist in SELECT clause
- Auto-adds missing columns to SELECT
- Maintains aggregation functions
The validation system uses three layers of protection:
Layer 1: LLM Prompt Engineering
ββ Explicit UNION syntax rules in prompt
ββ SQLite-specific guidelines
ββ 15+ critical SQL generation rules
Layer 2: Auto-Fix Functions
ββ _fix_union_syntax()
ββ _detect_and_fix_date_issues()
ββ _fix_group_by_columns()
ββ Regex pattern matching & SQL parsing
Layer 3: Validation & Feedback
ββ SQL syntax validation
ββ Error detection
ββ User-friendly warning messages
Based on comprehensive testing:
| Issue Type | Detection Rate | Auto-Fix Success | Test Coverage |
|---|---|---|---|
| UNION Syntax Errors | 100% | 100% | 5/5 tests passing |
| Date Filter Issues | 100% | 100% | 3/3 tests passing |
| GROUP BY Errors | 100% | 100% | 3/3 tests passing |
| Overall | 100% | 100% | 11/11 tests passing |
User Query: "Show me top 10 services by cost for both AWS and Azure"
LLM Generated (Invalid):
SELECT servicename, SUM(billedcost) as total FROM aws_cost_usage
GROUP BY servicename ORDER BY total DESC
UNION ALL
SELECT servicename, SUM(billedcost) FROM azure_cost_usage
GROUP BY servicename
LIMIT 10Issues Detected:
- β ORDER BY before UNION ALL
- β Missing column alias in second SELECT
β οΈ LIMIT positioning unclear
After Validation:
SELECT servicename, SUM(billedcost) as total FROM aws_cost_usage
GROUP BY servicename
UNION ALL
SELECT servicename, SUM(billedcost) as total FROM azure_cost_usage
GROUP BY servicename
ORDER BY total DESC
LIMIT 10Result: β Query executes successfully, returns correct data
Automatic (Default):
# Validation happens automatically in execute_natural_query()
result = engine.execute_natural_query("Show me costs for both clouds")
# SQL is validated and fixed before executionManual Testing:
# Test validation on specific SQL
fixed_sql = engine._validate_and_fix_sql(problematic_sql)Test Suite:
# Run validation tests
python test_sql_validation.py # UNION syntax tests
python test_date_between.py # Date filter tests
python test_fixes.py # GROUP BY testsThe system prevents future issues through:
- Enhanced LLM Prompts - 15+ explicit SQL rules for Ollama
- Pattern Detection - Regex patterns for common mistakes
- Proactive Validation - Check before execution, not after failure
- Clear Feedback - User-friendly warning messages
- Comprehensive Testing - 100% test coverage for all validation logic
- Code: See
text2sql_engine.pyβ_validate_and_fix_sql() - Tests:
test_sql_validation.py,test_union_fix.py,test_date_between.py,test_fixes.py - Examples: All validation scenarios covered in test files
The system includes powerful JSON extraction utilities for querying tags and cost categories within your cloud billing data.
Cloud cost data often includes JSON columns like:
tags:{"Environment": "Production", "Team": "Engineering", "Project": "CloudMigration"}cost_categories:{"CostCenter": "IT", "Department": "R&D"}
Standard SQL makes this hard to query. Our JSON helpers make it easy!
from database_manager import DatabaseManager
from json_query_helpers import JSONQueryHelper
# Initialize
db = DatabaseManager()
db.connect()
helper = JSONQueryHelper(db)
# Example 1: Get costs by Environment tag
df = helper.extract_json_field('aws_cost_usage', 'tags', 'Environment')
# Returns: Production: $15,234 | Development: $8,765
# Example 2: Query specific tag value
df = helper.query_by_tag('aws_cost_usage', 'Environment', 'Production')
# Returns: Breakdown by service/region for Production only
# Example 3: Discover available tags
tags = helper.get_available_json_keys('aws_cost_usage', 'tags')
# Returns: {'Environment', 'Team', 'Project', 'Owner'}
# Example 4: Natural language detection
result = helper.detect_tag_query("Show me costs by environment tag")
# Returns: {'is_tag_query': True, 'tag_key': 'Environment'}| Query Type | Example | Result |
|---|---|---|
| By Tag | "Show cost by environment tag" | Costs grouped by Environment values |
| Specific Value | "Show production environment costs" | Only Production tagged resources |
| By Cost Category | "Show costs by cost center" | Grouped by CostCenter category |
| Service Breakdown | "EC2 costs by team tag" | EC2 costs split by Team |
extract_json_field()- Extract and aggregate by JSON keyquery_by_tag()- Filter and group by tag valuesquery_by_cost_category()- Filter by cost categoryget_available_json_keys()- Discover available JSON keysgenerate_tag_query_sql()- Generate SQL for tag queriesdetect_tag_query()- Detect tag queries from natural language
The JSON helpers automatically integrate with the Text2SQL engine:
User: "Show me costs by environment tag"
β
Text2SQL detects tag query
β
Uses JSONQueryHelper.generate_tag_query_sql()
β
Executes: SELECT json_extract(tags, '$.Environment'), SUM(billed_cost)...
β
Returns: Nice formatted results by Environment
- You ask a question in natural language
- Engine analyzes your intent and requirements
- SQL is generated automatically (with Ollama or rule-based)
- Results displayed in tables and charts
Execute a natural language query
Request:
{
"question": "What is the total AWS cost?",
"explain": true
}Response:
{
"natural_query": "What is the total AWS cost?",
"sql_query": "SELECT SUM(billedcost) as total_cost FROM aws_cost",
"method": "Ollama",
"row_count": 1,
"results": [{"total_cost": 45678.92}],
"explanation": "Based on your question 'What is the total AWS cost?', the total comes to $45,678.92."
}Get database statistics for AWS and Azure
Response:
{
"aws": {"records": 1000, "total_cost": 45678.92},
"azure": {"records": 5000, "total_cost": 125432.18},
"combined": {"records": 6000, "total_cost": 171111.10}
}Get example queries organized by category
API health check and endpoint documentation
poetry install # Install all dependencies
poetry add package-name # Add new package
poetry remove package-name # Remove package
poetry update # Update all packages
poetry show # List installed packagesCloudCost-Intelligence/
βββ app.py # Streamlit web application
βββ api.py # FastAPI REST API endpoint
βββ cli.py # Command-line interface
βββ agentic_clarification.py # Smart query clarification with dynamic options
βββ json_query_helpers.py # JSON field extraction (tags, cost categories)
βββ database_manager.py # Database operations & data loading
βββ semantic_metadata.py # Schema metadata & semantic mappings
βββ text2sql_engine.py # Text-to-SQL converter with validation
βββ evaluation.md # Test cases and results
βββ METADATA_EXTRACTION.md # Technical documentation
βββ pyproject.toml # Poetry configuration
βββ README.md # This file
βββ test_suite.py # Comprehensive unit tests
βββ test_sql_validation.py # SQL validation tests (UNION fix)
βββ test_union_fix.py # UNION syntax correction tests
βββ test_date_between.py # Date filter removal tests
βββ test_fixes.py # GROUP BY validation tests
βββ test_dynamic_regions.py # Dynamic region extraction tests
βββ test_dynamic_clarification_flow.py # Clarification flow integration tests
βββ mock_data_sets/
βββ aws_cost_usage.csv # Sample AWS data
βββ azure_cost_usage.csv # Sample Azure data
Core Components:
database_manager.py- SQLite operations and data loadingsemantic_metadata.py- Schema understanding and aliasestext2sql_engine.py- Natural language to SQL with intelligent validationagentic_clarification.py- Smart query ambiguity detection with dynamic optionsjson_query_helpers.py- JSON extraction for tags and cost categories
User Interfaces:
app.py- Streamlit web interface with visualizationsapi.py- FastAPI REST API for programmatic accesscli.py- Command-line interface for terminal use
Validation & Testing:
test_suite.py- Unit tests for core componentstest_sql_validation.py- UNION syntax validation teststest_union_fix.py- UNION query correction teststest_date_between.py- Date filter handling teststest_fixes.py- GROUP BY validation teststest_dynamic_regions.py- Dynamic region extraction teststest_dynamic_clarification_flow.py- Clarification integration tests
Database Schema:
aws_cost- AWS billing data (billedcost, servicename, regionname, etc.)azure_cost- Azure billing data (same structure)
Access Methods:
- Web UI (Streamlit) - Interactive dashboard at
http://localhost:8501 - REST API (FastAPI) - RESTful endpoints at
http://localhost:8000 - CLI - Terminal-based queries
Environment variables (optional):
OLLAMA_BASE_URL- Ollama server (default:http://localhost:11434/v1)OLLAMA_MODEL- Model name (default:llama3.2)DATABASE_PATH- Database location (default:./cloud_cost.db)DEBUG- Debug mode (default:False)
App doesn't start?
- Make sure Poetry is installed:
poetry --version - Reinstall dependencies:
poetry install - Try running:
poetry run streamlit run app.py
Database not found?
- Initialize it:
poetry run python database_manager.py
Ollama not working?
- The app works without Ollama (uses rule-based mode)
- To use Ollama: Install from https://ollama.ai and run
ollama pull llama3.2
For Web UI (Streamlit):
- Click "Use Example Query" button to quickly try sample questions
- Results are automatically visualized with charts when appropriate
- Download results as CSV using the download button
- Query history keeps track of your last 10 queries
For REST API (FastAPI):
- Visit
/docsfor interactive Swagger documentation - Set
"explain": truein requests to get natural language explanations - Use
/statsto check database health - Use
/examplesto discover query patterns
For CLI:
- Type
helpto see available commands - Type
examplesto see sample queries - Type
statsto view database statistics - Press Ctrl+C to exit
- README.md (this file) - Setup and usage guide
- evaluation.md - Test cases with real query examples
- METADATA_EXTRACTION.md - Technical documentation on metadata extraction methodology
Web UI (Streamlit) - Best for:
- Interactive data exploration
- Visual analysis with charts
- Non-technical users
- Ad-hoc queries
REST API (FastAPI) - Best for:
- Integration with other applications
- Automation and scripting
- Building custom dashboards
- Microservices architecture
CLI - Best for:
- Server environments without GUI
- Shell scripting and automation
- Quick one-off queries
- SSH remote access
Built with: Python 3.11 | Poetry | Streamlit | FastAPI | Ollama | SQLite | Plotly