Single-author course project (Georgia Tech DBSE '25) – Haoyu Gao
DBIROptimizer is a lightweight research prototype that fuses classic database rewrites with modern compiler techniques. It parses SQL, applies logical/physical optimizations across a three-tier IR, and finally JIT-compiles LLVM IR for execution. The goal is to study how far a single developer can push IR-centric query compilation and what performance it can yield over an interpreted baseline (SQLite).
DBIROptimizer implements a three-tier IR approach to bridge high-level relational transformations with low-level code generation:
| Layer | Role | Notable Passes |
|---|---|---|
| HIR (logical) | relational algebra DAG | predicate push-down · constant folding |
| MIR (physical) | concrete operators & data-flow | heuristic join reorder · filter-project fusion · CSE |
| LIR (codegen) | LLVM IR for each operator | register reuse · auto-inlined expressions |
A shallow forward data-flow analysis tracks expression liveness; a redundancy pass eliminates duplicate sub-expressions. SQLite is reused only for table storage; all heavy logic executes in compiled native code.
DBIROptimizer/
├─ ir.h # IR definitions for HIR and MIR
├─ optimizer.h # Optimization pass declarations
├─ optimizer.cpp # Optimization implementations
├─ codegen.h # LLVM code generation interface
├─ codegen.cpp # LLVM code generation implementation
├─ dbir.cpp # Main program with parser and SQLite integration
├─ dbir_interface.py # Python interface for easier use
├─ CMakeLists.txt # Build configuration
└─ README.md # Project documentation
- Represents the logical query plan as a directed acyclic graph (DAG)
- Node types: Select, Project, Join, Filter, Scan, GroupBy, Aggregate
- Optimization passes:
- Predicate pushdown: Moves filter predicates closer to data sources
- Constant folding: Evaluates constant expressions at compile time
- Dataflow analysis: Identifies and eliminates unreachable expressions
- Represents the physical execution plan
- Operator types: HashJoin, NestedLoopJoin, HashAggregate, Scan, Filter, Project, FilterProject
- Optimization passes:
- Join reordering: Heuristic-based join order selection (up to 4 tables)
- Operator fusion: Combines adjacent operators (e.g., FilterProject)
- Common Subexpression Elimination (CSE): Eliminates redundant expressions
- Uses LLVM for code generation and JIT compilation
- Generates optimized native code for each operator
- Implements efficient functions for hash joins, filters, and scans
- Register allocation and reuse for expression evaluation
The system supports various expression types:
- Column references
- Constants (integer, float, string, boolean)
- Binary operations (arithmetic, comparison, logical)
- Function calls (partially implemented)
- LLVM ≥15
- SQLite3
- CMake ≥3.20
- C++17 compatible compiler
For consistent development, we recommend using the provided virtual machine environment:
-
Install Vagrant (version 2.3.4) and VirtualBox (version 7.0.6)
-
Create and set up the virtual machine:
mkdir DBIROptimizer_VM
cd DBIROptimizer_VM
vagrant box add lechenyu/cs6241
vagrant init lechenyu/cs6241- For Windows users, you may need to add this to your Vagrantfile to avoid errors:
config.vm.provider "virtualbox" do |v|
v.customize [ "modifyvm", :id, "--uartmode1", "disconnected" ]
end
- Launch and connect to the VM:
vagrant up
vagrant ssh- Inside the VM, clone the repository:
git clone https://github.com/hgao327/CS6423-project.git
cd CS6423-projectWe recommend using VS Code with remote development capabilities:
- Configure SSH for the virtual machine:
# For Linux/Mac
cd DBIROptimizer_VM
vagrant up
vagrant ssh-config >> ~/.ssh/config
# For Windows PowerShell
cd DBIROptimizer_VM
vagrant up
vagrant ssh-config | Out-File C:\Users\<USER_NAME>\.ssh\config -Append-
Install VS Code extensions:
- Remote-SSH
- C/C++
- CMake Tools
-
Connect to the VM using Remote-SSH and open the project folder
# Inside the repository directory
mkdir build
cmake -DCMAKE_INSTALL_PREFIX=./install -B build -S . -G Ninja
cd build && ninja installThis will build the project and install:
- The DBIROptimizer binary
- All necessary libraries
- Test scripts and benchmarks
# Run a SQL query with DBIROptimizer
./install/bin/dbir /path/to/database.db "SELECT * FROM orders WHERE status = 'Shipped'"
# Advanced benchmark with multiple queries
./install/bin/dbir /path/to/database.db --benchmarkFor easier interaction, use the provided Python interface:
# Create a test database
python dbir_interface.py --db test.db --create-test-db
# Run a simple query
python dbir_interface.py --db test.db --query "SELECT * FROM orders WHERE status = 'Shipped'"
# Benchmark a query
python dbir_interface.py --db test.db --query "SELECT o.order_id, c.name FROM orders o JOIN customers c ON o.customer_id = c.id WHERE c.city = 'NY'" --benchmark --runs 5
# Run TPC-H benchmark queries
python dbir_interface.py --db tpch.db --tpch-queries benchmark/tpch_queries.json --benchmarkpython dbir_interface.py --db synth.db --create-test-dbThis creates a sample database with:
customerstable (100 records)orderstable (1000 records)- Relationship between customers and orders
For TPC-H benchmarks, you'll need to set up the TPC-H schema and load data:
# Assuming you have the TPC-H tools installed
cd /path/to/tpch-tools
dbgen -s 1 # Scale factor 1
# Use the generated .tbl files to create your SQLite database# 1. Build the project
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=../install ..
ninja install
cd ..
# 2. Create and initialize a test database
python dbir_interface.py --db test.db --create-test-db
# 3. Run a simple query
./install/bin/dbir test.db "SELECT * FROM orders WHERE status = 'Shipped'"
# 4. Benchmark against SQLite
python dbir_interface.py --db test.db --query "SELECT o.order_id, c.name FROM orders o JOIN customers c ON o.customer_id = c.id WHERE c.city = 'NY'" --benchmarkTo reproduce the benchmarks from the report:
# Synthetic benchmark
./install/bin/run_bench benchmark/synthetic/workload.yaml
# TPC-H benchmark (Scale Factor 5)
./install/bin/run_bench benchmark/tpch_subset/tpch_sf5.yamlEach script outputs CSV with sqlite, unopt, and opt columns for comparison.
Benchmarks run on an Intel i7-8650U CPU with 16GB RAM, Ubuntu 22.04:
| Query Type | SQLite (ms) | Unopt (ms) | Opt (ms) | Speedup |
|---|---|---|---|---|
| Filter-only | 430 | 360 | 310 | 1.39x |
| Equality Join | 1500 | 1350 | 1100 | 1.36x |
| Join + Filter | 1680 | 1410 | 1200 | 1.40x |
| Group+Filter | 1150 | 960 | 800 | 1.44x |
| TPC-H Query | SQLite (s) | Unopt (s) | Opt (s) | Speedup |
|---|---|---|---|---|
| Q1 | 1.82 | 1.55 | 1.34 | 1.36x |
| Q3 | 2.71 | 2.30 | 2.02 | 1.34x |
| Q6 | 0.86 | 0.75 | 0.65 | 1.32x |
| Q10 | 3.10 | 2.64 | 2.30 | 1.35x |
| Q14 | 2.55 | 2.16 | 1.85 | 1.38x |
JIT compilation typically adds 30-60 ms overhead per query, which is included in the times above.
- SQL Parsing: SQL query is parsed into an abstract syntax tree (AST)
- HIR Generation: AST is transformed into HIR nodes representing logical operators
- HIR Optimization: Optimizations like predicate pushdown and constant folding are applied
- MIR Generation: HIR is converted to MIR nodes representing physical operators
- MIR Optimization: Join reordering, operator fusion, and redundancy elimination
- Code Generation: MIR is translated to LLVM IR
- JIT Compilation: LLVM IR is compiled to native code
- Execution: Compiled code is executed with SQLite as the storage layer
Expressions are evaluated through a recursive visitor pattern:
- Constants are folded at compile time
- Column references are translated to tuple attribute accesses
- Binary operations are translated to corresponding LLVM instructions
- Common subexpressions are detected and computed only once
DBIROptimizer uses SQLite for:
- Table storage and basic schema information
- Row scanning operations (data is read from SQLite, then processed by compiled code)
- Query result storage
Vagrant syncs the folder /vagrant on the virtual machine with the folder containing your Vagrantfile on the host machine. You can use these synced folders to transfer files between the VM and your host system.
- Join Optimization: Currently limited to heuristic join reordering for up to 4 tables. A full dynamic programming approach is planned.
- Execution Model: Uses tuple-at-a-time execution. Future work includes vectorized execution and SIMD instructions.
- Parallelism: Single-threaded execution only. Parallelization is a key future enhancement.
- Dataflow Analysis: Basic forward dataflow analysis. More advanced SSA-based approaches planned.
- User-Defined Functions: Limited support. Full integration with LLVM-based code generation is planned.
- Adaptive Execution: Runtime statistics collection and re-optimization is planned.
- Memory Management: Basic memory management. More sophisticated buffer management is planned.