Skip to content

Rishikesh63/cloudcost-intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CloudCost Intelligence ☁️

A Text-to-SQL engine that converts natural language questions into SQL queries to analyze cloud cost data from AWS and Azure.

✨ Features

  • πŸ—£οΈ Natural Language Processing - Ask questions in plain English
  • πŸ€– FREE Local LLM - Uses Ollama (no API keys, 100% private)
  • 🀝 Dynamic Agentic Clarification - Smart detection of ambiguous queries with database-driven options
  • πŸ” Intelligent SQL Validation - Auto-fix for UNION syntax, GROUP BY errors, and date filters
  • πŸ“Š Interactive Dashboard - Built with Streamlit
  • 🌐 RESTful API - FastAPI endpoint for programmatic access
  • πŸ“ˆ Auto Visualizations - Charts and graphs generated automatically
  • ☁️ Multi-Cloud Support - Analyze AWS and Azure costs together
  • πŸ“œ Query History - Track and reuse previous queries
  • πŸ”Œ Multiple Interfaces - Web UI, REST API, and CLI
  • πŸ›‘οΈ Robust Error Prevention - Multi-layer validation prevents SQL errors before execution

πŸ“‹ Prerequisites

  • Python 3.11+
  • Poetry (dependency management)
  • Ollama (optional - for AI-powered queries)

πŸš€ Quick Start

1. Install Poetry

(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -

2. Clone & Setup

# Clone the repository
git clone <your-repo-url>
cd CloudCost-Intelligence

# Install dependencies
poetry install

# Initialize database with sample data
poetry run python database_manager.py

3. (Optional) Install Ollama

# Download from: https://ollama.ai/download
# Then pull the model:
ollama pull llama3.2

4. Run the Application

Choose one of the three interfaces:

Option A: Web UI (Streamlit) - Recommended for Interactive Use

# Direct Poetry command
poetry run streamlit run app.py

The app will open at http://localhost:8501

Option B: REST API (FastAPI) - For Programmatic Access

# Run with Poetry
poetry run python api.py

The API will be available at:

  • Interactive Docs: http://localhost:8000/docs (Swagger UI)
  • Alternative Docs: http://localhost:8000/redoc (ReDoc)
  • API Endpoint: http://localhost:8000

Example API Usage:

# Test the API with curl
curl -X POST http://localhost:8000/query `
  -H "Content-Type: application/json" `
  -d '{\"question\": \"What is the total AWS cost?\", \"explain\": true}'

# Get database statistics
curl http://localhost:8000/stats

# Get example queries
curl http://localhost:8000/examples

Option C: Command-Line Interface (CLI)

# Interactive mode
poetry run python cli.py

# Or run directly if environment is activated
python cli.py

πŸ’‘ Example Queries

Try asking:

  • "What is the total AWS cost?"
  • "Show me top 10 services by cost"
  • "What are the costs by region?"
  • "Show me Azure costs by service"
  • "What is the average cost per service?"
  • "Show me the top 5 most expensive regions"

🀝 Agentic Clarification (Smart Query Assistant)

The Agentic Clarification feature intelligently detects ambiguous queries and asks follow-up questions with dynamically generated options from your actual database to gather missing context before execution.

🎯 What It Detects

The system automatically identifies these types of ambiguities:

  1. Missing Time Range πŸ“…

    • Query: "What is the total cost?"
    • Clarifies: Which time period? (Last 7 days, last month, YTD, etc.)
  2. Missing Cloud Provider ☁️

    • Query: "Show me EC2 costs"
    • Clarifies: AWS only, Azure only, or both?
  3. Ambiguous Region 🌍

    • Query: "Show me regional costs"
    • Clarifies with actual regions from your database: us-east-1, eu-west-1, etc. (AWS) or East US, West Europe, etc. (Azure)
    • Dynamic extraction: Regions are pulled from database in real-time, not hardcoded
  4. Ambiguous Cost Metric πŸ’°

    • Query: "Show me the cost"
    • Clarifies: Billed cost (invoice) or effective cost (with discounts)?
  5. Missing Limit in Top-N Queries πŸ”’

    • Query: "Show me top services"
    • Clarifies: Top 5, 10, 20, or 50?

πŸ”„ Dynamic Options Generation

NEW: Clarification options are now generated dynamically from your actual database:

# Regions are extracted live from database
aws_regions = ["us-east-1", "eu-west-1", "ap-southeast-2", ...]  # From actual data
azure_regions = ["East US", "West Europe", "Australia East", ...]  # From actual data

# Time ranges calculated from actual billing periods
time_ranges = [
    "Last 7 days (data from 2024-01-23 to 2024-01-30)",
    "Last 30 days (data from 2023-12-31 to 2024-01-30)",
    "This month (January 2024)",
    # ... based on your actual data
]

Benefits:

  • βœ… Always shows relevant options based on your data
  • βœ… No hardcoded values that might not exist in your database
  • βœ… Provider-specific regions (AWS vs Azure automatically separated)
  • βœ… Adapts as your data changes

πŸš€ How to Use

Standalone Testing

# Run the clarification demo
poetry run python agentic_clarification.py

This will show test cases demonstrating how the clarifier works:

πŸ“ Query: Show me EC2 cost
❓ Which time range would you like me to use?
   Options:
   - Last 7 days (last_7_days)
   - Last 30 days (last_30_days)
   - This month (this_month)
   - Year to date (year_to_date)
   Missing: time_range
βœ… Enhanced: Show me EC2 cost in the last 7 days

Programmatic Usage

from agentic_clarification import AgenticClarifier

# Initialize clarifier
clarifier = AgenticClarifier()

# Analyze a query
result = clarifier.analyze_query("What is the total cost?")

if result["needs_clarification"]:
    # Display question to user
    print(result["question"])
    
    # Show options
    for option in result["options"]:
        print(f"- {option['label']}")
    
    # User selects an option (e.g., "last_30_days")
    selected_value = "last_30_days"
    
    # Apply context to enhance query
    enhanced_query = clarifier.apply_context(
        original_query="What is the total cost?",
        context_key=result["missing_context"][0],
        context_value=selected_value
    )
    
    print(f"Enhanced query: {enhanced_query}")
    # Output: "What is the total cost? in the last 30 days"

Integration with Existing Interfaces

The clarification system can be integrated into:

Web UI (Streamlit):

# Before executing query
clarifier = AgenticClarifier()
result = clarifier.analyze_query(user_query)

if result["needs_clarification"]:
    # Show selectbox with options
    selected = st.selectbox(result["question"], 
                           [opt["label"] for opt in result["options"]])
    
    # Find selected value
    selected_value = next(opt["value"] for opt in result["options"] 
                         if opt["label"] == selected)
    
    # Enhance query
    user_query = clarifier.apply_context(user_query, 
                                         result["missing_context"][0],
                                         selected_value)

REST API (FastAPI):

@app.post("/query/clarify")
async def clarify_query(question: str):
    clarifier = AgenticClarifier()
    result = clarifier.analyze_query(question)
    return result

@app.post("/query/apply-context")
async def apply_context(query: str, context_key: str, context_value: str):
    clarifier = AgenticClarifier()
    enhanced = clarifier.apply_context(query, context_key, context_value)
    return {"enhanced_query": enhanced}

CLI:

result = clarifier.analyze_query(user_input)
if result["needs_clarification"]:
    print(f"\n{result['question']}")
    for i, opt in enumerate(result["options"], 1):
        print(f"{i}. {opt['label']}")
    choice = int(input("Select option: ")) - 1
    user_input = clarifier.apply_context(user_input, 
                                         result["missing_context"][0],
                                         result["options"][choice]["value"])

πŸ“Š Example Scenarios

Ambiguous Query Clarification Question Options
"Show me EC2 cost" Which time range would you like me to use? Last 7 days, Last 30 days, This month, YTD, etc.
"Show me the cost" Which cost metric would you like to use? Billed Cost, Effective Cost
"Show me top services" How many results would you like to see? Top 5, Top 10, Top 20, Top 50
"What are regional costs?" Which region would you like to analyze? All regions, us-east-1, eu-west-1, etc.
"Show me S3 costs" Which cloud provider would you like to analyze? AWS, Azure, Both

🧠 Smart Features

Dynamic Database Integration: The clarifier queries your actual database:

  • Extracts DISTINCT regions from aws_cost_usage and azure_cost_usage tables
  • Separates AWS regions from Azure regions automatically
  • Calculates actual date ranges from billing data
  • Returns only relevant options (no non-existent values)

Context Application: The clarifier doesn't just ask questionsβ€”it intelligently applies your selections:

  • Time ranges β†’ Added as temporal constraints
  • Providers β†’ Added as table filters
  • Limits β†’ Inserted into TOP-N clauses
  • Metrics β†’ Replaces generic "cost" with specific column
  • Regions β†’ Added as WHERE clause filters with actual region names

Query Enhancement:

Original:  "Show me top services"
Enhanced:  "Show me top 10 services in the last 30 days for AWS"

πŸŽ“ Benefits

  • βœ… Reduces Errors - Catches missing context before execution
  • βœ… Saves Time - No need to reformulate queries manually
  • βœ… Better Results - Ensures you get exactly what you're looking for
  • βœ… User-Friendly - Guides users with helpful options
  • βœ… Flexible - Works with any interface (Web, API, CLI)
  • βœ… Data-Driven - Options generated from your actual database, not hardcoded

πŸ›‘οΈ Intelligent SQL Validation & Auto-Fix

The engine includes a comprehensive multi-layer SQL validation system that automatically detects and fixes common SQL errors before execution, preventing failures and ensuring reliable results.

πŸ” What It Fixes Automatically

1. UNION Query Syntax Errors

Problem: LLMs often generate invalid UNION queries with ORDER BY before UNION ALL

Example Error:

-- ❌ INCORRECT (generated by LLM)
SELECT servicename, SUM(billedcost) FROM aws_cost_usage
GROUP BY servicename
ORDER BY total_cost DESC  -- ERROR: ORDER BY before UNION
UNION ALL
SELECT servicename, SUM(billedcost) FROM azure_cost_usage
GROUP BY servicename
LIMIT 10

Auto-Fix Applied:

-- βœ… CORRECTED (by validation system)
SELECT servicename, SUM(billedcost) FROM aws_cost_usage
GROUP BY servicename
UNION ALL
SELECT servicename, SUM(billedcost) FROM azure_cost_usage
GROUP BY servicename
ORDER BY total_cost DESC  -- Moved to end
LIMIT 10

How It Works:

  • Pattern matching detects ORDER BY/LIMIT before UNION
  • Extracts and removes them from individual SELECTs
  • Re-appends after the entire UNION query
  • Validates final SQL structure

2. Date Filter Issues with Malformed Data

Problem: Date filters fail when data has malformed timestamps

Example Error:

Query returned 0 rows - date filter may not match data format
Date values like: 00:00.0 (malformed)

Auto-Fix Applied:

-- ❌ BEFORE: Query with date filter
SELECT * FROM aws_cost_usage
WHERE billingperiodstart BETWEEN date('2024-01-01') AND date('2024-01-31')

-- βœ… AFTER: Date filter removed + warning shown
SELECT * FROM aws_cost_usage
-- Warning: Date filter removed due to malformed date data

How It Works:

  • Detects BETWEEN clauses with date() functions
  • Checks for malformed date data (patterns like 00:00.0)
  • Removes problematic date filters
  • Displays user-friendly warning message

3. GROUP BY Column Validation

Problem: Columns in GROUP BY but not in SELECT clause

Example Error:

-- ❌ INCORRECT
SELECT SUM(billedcost) as total_cost
FROM aws_cost_usage
GROUP BY servicename, regionname  -- Error: columns not in SELECT

Auto-Fix Applied:

-- βœ… CORRECTED
SELECT servicename, regionname, SUM(billedcost) as total_cost
FROM aws_cost_usage
GROUP BY servicename, regionname  -- Now columns are in SELECT

How It Works:

  • Extracts GROUP BY columns from query
  • Validates they exist in SELECT clause
  • Auto-adds missing columns to SELECT
  • Maintains aggregation functions

πŸ—οΈ Multi-Layer Validation Architecture

The validation system uses three layers of protection:

Layer 1: LLM Prompt Engineering
β”œβ”€ Explicit UNION syntax rules in prompt
β”œβ”€ SQLite-specific guidelines
└─ 15+ critical SQL generation rules

Layer 2: Auto-Fix Functions
β”œβ”€ _fix_union_syntax()
β”œβ”€ _detect_and_fix_date_issues()
β”œβ”€ _fix_group_by_columns()
└─ Regex pattern matching & SQL parsing

Layer 3: Validation & Feedback
β”œβ”€ SQL syntax validation
β”œβ”€ Error detection
└─ User-friendly warning messages

πŸ“Š Validation Statistics

Based on comprehensive testing:

Issue Type Detection Rate Auto-Fix Success Test Coverage
UNION Syntax Errors 100% 100% 5/5 tests passing
Date Filter Issues 100% 100% 3/3 tests passing
GROUP BY Errors 100% 100% 3/3 tests passing
Overall 100% 100% 11/11 tests passing

🎯 Real-World Example

User Query: "Show me top 10 services by cost for both AWS and Azure"

LLM Generated (Invalid):

SELECT servicename, SUM(billedcost) as total FROM aws_cost_usage
GROUP BY servicename ORDER BY total DESC
UNION ALL
SELECT servicename, SUM(billedcost) FROM azure_cost_usage
GROUP BY servicename
LIMIT 10

Issues Detected:

  1. ❌ ORDER BY before UNION ALL
  2. ❌ Missing column alias in second SELECT
  3. ⚠️ LIMIT positioning unclear

After Validation:

SELECT servicename, SUM(billedcost) as total FROM aws_cost_usage
GROUP BY servicename
UNION ALL
SELECT servicename, SUM(billedcost) as total FROM azure_cost_usage
GROUP BY servicename
ORDER BY total DESC
LIMIT 10

Result: βœ… Query executes successfully, returns correct data

πŸ”§ How to Use

Automatic (Default):

# Validation happens automatically in execute_natural_query()
result = engine.execute_natural_query("Show me costs for both clouds")
# SQL is validated and fixed before execution

Manual Testing:

# Test validation on specific SQL
fixed_sql = engine._validate_and_fix_sql(problematic_sql)

Test Suite:

# Run validation tests
python test_sql_validation.py   # UNION syntax tests
python test_date_between.py     # Date filter tests
python test_fixes.py             # GROUP BY tests

πŸ’‘ Prevention Best Practices

The system prevents future issues through:

  1. Enhanced LLM Prompts - 15+ explicit SQL rules for Ollama
  2. Pattern Detection - Regex patterns for common mistakes
  3. Proactive Validation - Check before execution, not after failure
  4. Clear Feedback - User-friendly warning messages
  5. Comprehensive Testing - 100% test coverage for all validation logic

πŸ“š Documentation

  • Code: See text2sql_engine.py β†’ _validate_and_fix_sql()
  • Tests: test_sql_validation.py, test_union_fix.py, test_date_between.py, test_fixes.py
  • Examples: All validation scenarios covered in test files

πŸ›οΈ JSON Query Helpers (Tag-Based Queries)

The system includes powerful JSON extraction utilities for querying tags and cost categories within your cloud billing data.

πŸ“¦ What It Does

Cloud cost data often includes JSON columns like:

  • tags: {"Environment": "Production", "Team": "Engineering", "Project": "CloudMigration"}
  • cost_categories: {"CostCenter": "IT", "Department": "R&D"}

Standard SQL makes this hard to query. Our JSON helpers make it easy!

πŸš€ Usage

from database_manager import DatabaseManager
from json_query_helpers import JSONQueryHelper

# Initialize
db = DatabaseManager()
db.connect()
helper = JSONQueryHelper(db)

# Example 1: Get costs by Environment tag
df = helper.extract_json_field('aws_cost_usage', 'tags', 'Environment')
# Returns: Production: $15,234 | Development: $8,765

# Example 2: Query specific tag value
df = helper.query_by_tag('aws_cost_usage', 'Environment', 'Production')
# Returns: Breakdown by service/region for Production only

# Example 3: Discover available tags
tags = helper.get_available_json_keys('aws_cost_usage', 'tags')
# Returns: {'Environment', 'Team', 'Project', 'Owner'}

# Example 4: Natural language detection
result = helper.detect_tag_query("Show me costs by environment tag")
# Returns: {'is_tag_query': True, 'tag_key': 'Environment'}

πŸ“Š Supported Queries

Query Type Example Result
By Tag "Show cost by environment tag" Costs grouped by Environment values
Specific Value "Show production environment costs" Only Production tagged resources
By Cost Category "Show costs by cost center" Grouped by CostCenter category
Service Breakdown "EC2 costs by team tag" EC2 costs split by Team

πŸ› οΈ Available Methods

  • extract_json_field() - Extract and aggregate by JSON key
  • query_by_tag() - Filter and group by tag values
  • query_by_cost_category() - Filter by cost category
  • get_available_json_keys() - Discover available JSON keys
  • generate_tag_query_sql() - Generate SQL for tag queries
  • detect_tag_query() - Detect tag queries from natural language

✨ Integration with Text2SQL

The JSON helpers automatically integrate with the Text2SQL engine:

User: "Show me costs by environment tag"
      ↓
Text2SQL detects tag query
      ↓
Uses JSONQueryHelper.generate_tag_query_sql()
      ↓
Executes: SELECT json_extract(tags, '$.Environment'), SUM(billed_cost)...
      ↓
Returns: Nice formatted results by Environment

🎯 How It Works

  1. You ask a question in natural language
  2. Engine analyzes your intent and requirements
  3. SQL is generated automatically (with Ollama or rule-based)
  4. Results displayed in tables and charts

🌐 API Endpoints (FastAPI)

POST /query

Execute a natural language query

Request:

{
  "question": "What is the total AWS cost?",
  "explain": true
}

Response:

{
  "natural_query": "What is the total AWS cost?",
  "sql_query": "SELECT SUM(billedcost) as total_cost FROM aws_cost",
  "method": "Ollama",
  "row_count": 1,
  "results": [{"total_cost": 45678.92}],
  "explanation": "Based on your question 'What is the total AWS cost?', the total comes to $45,678.92."
}

GET /stats

Get database statistics for AWS and Azure

Response:

{
  "aws": {"records": 1000, "total_cost": 45678.92},
  "azure": {"records": 5000, "total_cost": 125432.18},
  "combined": {"records": 6000, "total_cost": 171111.10}
}

GET /examples

Get example queries organized by category

GET /

API health check and endpoint documentation

πŸ› οΈ Common Poetry Commands

poetry install              # Install all dependencies
poetry add package-name     # Add new package
poetry remove package-name  # Remove package
poetry update              # Update all packages
poetry show                # List installed packages

πŸ“ Project Structure

CloudCost-Intelligence/
β”œβ”€β”€ app.py                              # Streamlit web application
β”œβ”€β”€ api.py                              # FastAPI REST API endpoint
β”œβ”€β”€ cli.py                              # Command-line interface
β”œβ”€β”€ agentic_clarification.py            # Smart query clarification with dynamic options
β”œβ”€β”€ json_query_helpers.py               # JSON field extraction (tags, cost categories)
β”œβ”€β”€ database_manager.py                 # Database operations & data loading
β”œβ”€β”€ semantic_metadata.py                # Schema metadata & semantic mappings
β”œβ”€β”€ text2sql_engine.py                  # Text-to-SQL converter with validation
β”œβ”€β”€ evaluation.md                       # Test cases and results
β”œβ”€β”€ METADATA_EXTRACTION.md              # Technical documentation
β”œβ”€β”€ pyproject.toml                      # Poetry configuration
β”œβ”€β”€ README.md                           # This file
β”œβ”€β”€ test_suite.py                       # Comprehensive unit tests
β”œβ”€β”€ test_sql_validation.py              # SQL validation tests (UNION fix)
β”œβ”€β”€ test_union_fix.py                   # UNION syntax correction tests
β”œβ”€β”€ test_date_between.py                # Date filter removal tests
β”œβ”€β”€ test_fixes.py                       # GROUP BY validation tests
β”œβ”€β”€ test_dynamic_regions.py             # Dynamic region extraction tests
β”œβ”€β”€ test_dynamic_clarification_flow.py  # Clarification flow integration tests
└── mock_data_sets/
    β”œβ”€β”€ aws_cost_usage.csv              # Sample AWS data
    └── azure_cost_usage.csv            # Sample Azure data

πŸ—οΈ Architecture

Core Components:

  • database_manager.py - SQLite operations and data loading
  • semantic_metadata.py - Schema understanding and aliases
  • text2sql_engine.py - Natural language to SQL with intelligent validation
  • agentic_clarification.py - Smart query ambiguity detection with dynamic options
  • json_query_helpers.py - JSON extraction for tags and cost categories

User Interfaces:

  • app.py - Streamlit web interface with visualizations
  • api.py - FastAPI REST API for programmatic access
  • cli.py - Command-line interface for terminal use

Validation & Testing:

  • test_suite.py - Unit tests for core components
  • test_sql_validation.py - UNION syntax validation tests
  • test_union_fix.py - UNION query correction tests
  • test_date_between.py - Date filter handling tests
  • test_fixes.py - GROUP BY validation tests
  • test_dynamic_regions.py - Dynamic region extraction tests
  • test_dynamic_clarification_flow.py - Clarification integration tests

Database Schema:

  • aws_cost - AWS billing data (billedcost, servicename, regionname, etc.)
  • azure_cost - Azure billing data (same structure)

Access Methods:

  • Web UI (Streamlit) - Interactive dashboard at http://localhost:8501
  • REST API (FastAPI) - RESTful endpoints at http://localhost:8000
  • CLI - Terminal-based queries

βš™οΈ Configuration

Environment variables (optional):

  • OLLAMA_BASE_URL - Ollama server (default: http://localhost:11434/v1)
  • OLLAMA_MODEL - Model name (default: llama3.2)
  • DATABASE_PATH - Database location (default: ./cloud_cost.db)
  • DEBUG - Debug mode (default: False)

πŸ”§ Troubleshooting

App doesn't start?

  • Make sure Poetry is installed: poetry --version
  • Reinstall dependencies: poetry install
  • Try running: poetry run streamlit run app.py

Database not found?

  • Initialize it: poetry run python database_manager.py

Ollama not working?

  • The app works without Ollama (uses rule-based mode)
  • To use Ollama: Install from https://ollama.ai and run ollama pull llama3.2

🎯 Tips

For Web UI (Streamlit):

  • Click "Use Example Query" button to quickly try sample questions
  • Results are automatically visualized with charts when appropriate
  • Download results as CSV using the download button
  • Query history keeps track of your last 10 queries

For REST API (FastAPI):

  • Visit /docs for interactive Swagger documentation
  • Set "explain": true in requests to get natural language explanations
  • Use /stats to check database health
  • Use /examples to discover query patterns

For CLI:

  • Type help to see available commands
  • Type examples to see sample queries
  • Type stats to view database statistics
  • Press Ctrl+C to exit

πŸ“š Documentation

  • README.md (this file) - Setup and usage guide
  • evaluation.md - Test cases with real query examples
  • METADATA_EXTRACTION.md - Technical documentation on metadata extraction methodology

πŸŽ“ Use Cases

Web UI (Streamlit) - Best for:

  • Interactive data exploration
  • Visual analysis with charts
  • Non-technical users
  • Ad-hoc queries

REST API (FastAPI) - Best for:

  • Integration with other applications
  • Automation and scripting
  • Building custom dashboards
  • Microservices architecture

CLI - Best for:

  • Server environments without GUI
  • Shell scripting and automation
  • Quick one-off queries
  • SSH remote access

Built with: Python 3.11 | Poetry | Streamlit | FastAPI | Ollama | SQLite | Plotly

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages