A high-performance CLI tool for archiving database data to S3-compatible object storage.
Currently supports: PostgreSQL input (partitioned tables, plus non-partitioned tables when --date-column, --start-date, and --end-date are provided) and S3-compatible object storage output.
- π Parallel Processing - Archive multiple partitions concurrently with configurable workers
- π Beautiful Progress UI - Real-time progress tracking with dual progress bars
- π Embedded Cache Viewer - Beautiful web interface with real-time updates:
- WebSocket Live Updates - Real-time data streaming without polling
- Interactive task monitoring showing current partition and operation
- Clickable partition names to jump directly to table row
- Shows archiver status (running/idle) with PID tracking
- Live statistics: total partitions, sizes, compression ratios
- Sortable table with S3 upload status indicators
- Smooth animations highlight data changes
- Error tracking with timestamps
- Auto-reconnecting WebSocket for reliability
- πΎ Intelligent Caching - Advanced caching system for maximum efficiency:
- Caches row counts for 24 hours (refreshed daily)
- Caches file metadata permanently (size, MD5, compression ratio)
- Tracks errors with timestamps
- Skip extraction/compression entirely when cached metadata matches S3
- π Data Integrity - Comprehensive file integrity verification:
- Size comparison (both compressed and uncompressed)
- MD5 hash verification for single-part uploads
- Multipart ETag verification for large files (>100MB)
- Automatic multipart upload for files >100MB
- β‘ Smart Compression - Uses Zstandard compression with multi-core support
- π Intelligent Resume - Three-level skip detection:
- Fast skip using cached metadata (no extraction needed)
- Skip if S3 file matches after local processing
- Re-upload if size or hash differs
- π― Flexible Partition Support - Handles multiple partition naming formats:
table_YYYYMMDD(e.g.,messages_20240315)table_pYYYYMMDD(e.g.,messages_p20240315)table_YYYY_MM(e.g.,messages_2024_03)
- Go 1.22 or higher
- PostgreSQL database with partitioned tables (format:
tablename_YYYYMMDD) or non-partitioned tables when you supply--date-column,--start-date, and--end-dateso the archiver can build synthetic windows - S3-compatible object storage (Hetzner, AWS S3, MinIO, etc.)
The easiest way to install on macOS or Linux:
brew install airframesio/tap/data-archiverDownload the latest release for your platform from the releases page.
go install github.com/airframesio/data-archiver@latestgit clone https://github.com/airframesio/data-archiver.git
cd data-archiver
go build -o data-archiverdata-archiver \
--db-user myuser \
--db-password mypass \
--db-name mydb \
--table flights \
--s3-endpoint https://fsn1.your-objectstorage.com \
--s3-bucket my-archive-bucket \
--s3-access-key YOUR_ACCESS_KEY \
--s3-secret-key YOUR_SECRET_KEY \
--path-template "archives/{table}/{YYYY}/{MM}" \
--start-date 2024-01-01 \
--end-date 2024-01-31Advanced Example with Custom Output:
data-archiver \
--db-user myuser \
--db-password mypass \
--db-name mydb \
--table flights \
--s3-endpoint https://fsn1.your-objectstorage.com \
--s3-bucket my-archive-bucket \
--s3-access-key YOUR_ACCESS_KEY \
--s3-secret-key YOUR_SECRET_KEY \
--path-template "data/{table}/year={YYYY}/month={MM}" \
--output-format parquet \
--compression lz4 \
--compression-level 5 \
--output-duration daily \
--start-date 2024-01-01 \
--end-date 2024-01-31# Archive data to S3
data-archiver archive [flags]
# Dump database using pg_dump to S3
data-archiver dump [flags]
# Dump schema once, then emit date-sliced data dumps via pg_dump
data-archiver dump-hybrid [flags]
# Restore data from S3
data-archiver restore [flags]Data Archiver
A CLI tool to efficiently archive database data to object storage.
Supports multiple output formats (JSONL/CSV/Parquet), compression types (Zstandard/LZ4/Gzip),
and flexible path templates for S3-compatible storage.
Usage:
data-archiver [flags]
Flags:
--viewer start embedded cache viewer web server
--compression string compression type: zstd, lz4, gzip, none (default "zstd")
--compression-level int compression level (zstd: 1-22, lz4/gzip: 1-9, none: 0) (default 3)
--config string config file (default is $HOME/.data-archiver.yaml)
--date-column string timestamp column name for duration-based splitting (optional)
--db-host string PostgreSQL host (default "localhost")
--db-name string PostgreSQL database name
--db-password string PostgreSQL password
--db-port int PostgreSQL port (default 5432)
--db-sslmode string PostgreSQL SSL mode (disable, require, verify-ca, verify-full) (default "disable")
--db-user string PostgreSQL user
-d, --debug enable debug output
--dry-run perform a dry run without uploading
--end-date string end date (YYYY-MM-DD) (default "2025-08-27")
-h, --help help for data-archiver
--output-duration string output file duration: hourly, daily, weekly, monthly, yearly (default "daily")
--output-format string output format: jsonl, csv, parquet (default "jsonl")
--path-template string S3 path template with placeholders: {table}, {YYYY}, {MM}, {DD}, {HH} (required)
--s3-access-key string S3 access key
--s3-bucket string S3 bucket name
--s3-endpoint string S3-compatible endpoint URL
--s3-region string S3 region (default "auto")
--s3-secret-key string S3 secret key
--skip-count skip counting rows (faster startup, no progress bars)
--start-date string start date (YYYY-MM-DD)
--table string base table name (required)
--viewer-port int port for cache viewer web server (default 8080)
--workers int number of parallel workers (default 4)
--table- Base table name (without date suffix)--path-template- S3 path template with placeholders (e.g.,"archives/{table}/{YYYY}/{MM}")--db-user- PostgreSQL username--db-name- PostgreSQL database name--s3-endpoint- S3-compatible endpoint URL--s3-bucket- S3 bucket name--s3-access-key- S3 access key--s3-secret-key- S3 secret key
--output-format- Output file format:jsonl(default),csv, orparquet--compression- Compression type:zstd(default),lz4,gzip, ornone--compression-level- Compression level (default: 3)- Zstandard: 1-22 (higher = better compression, slower)
- LZ4/Gzip: 1-9 (higher = better compression, slower)
--output-duration- File duration:hourly,daily(default),weekly,monthly, oryearly--date-column- Timestamp column for duration-based splitting. Required when archiving non-partitioned tables so the archiver can build synthetic windows.--chunk-size- Number of rows to process per chunk (default: 10000, range: 100-1000000)- Tune based on average row size for optimal memory usage
- Smaller chunks for large rows, larger chunks for small rows
The archive command now supports base tables that aren't physically partitioned. Provide:
--date-columnso rows can be filtered by time--start-dateand--end-dateto define the overall range--output-durationto control the slice size (daily, weekly, etc.)
When no partitions are discovered, the archiver automatically slices the base table into synthetic windows covering the requested range and streams each window through the normal extraction/compression/upload pipeline.
Use data-archiver dump-hybrid when you need a schema dump plus partitioned data files generated directly by pg_dump.
- Step 1: Dump the parent table schema once (partitions are automatically excluded).
- Step 2: Discover partitions that match the provided date range and upload grouped dumps using
--path-template+--output-duration. - Any dates not covered by physical partitions are still exported by filtering the parent table via temporary staging tables, so mixed partition/non-partition layouts are supported.
- Ideal for storing schema metadata next to date-windowed
pg_dumparchives without manual SQL. - Requires
--date-columnso non-partitioned tables can be filtered viapg_dump --where.
Example: dump the events table schema plus daily data files for January 2024.
data-archiver dump-hybrid \
--db-host localhost \
--db-port 5432 \
--db-user myuser \
--db-password mypass \
--db-name analytics \
--table events \
--date-column created_at \
--start-date 2024-01-01 \
--end-date 2024-01-31 \
--s3-endpoint https://s3.example.com \
--s3-bucket pg-dumps \
--s3-access-key YOUR_KEY \
--s3-secret-key YOUR_SECRET \
--path-template "pgdumps/{table}/{YYYY}/{MM}" \
--output-duration dailyThe schema is uploaded once (e.g., pgdumps/events/schema.dump), then pg_dump writes compressed custom-format data for each daily/weekly/monthly grouping such as pgdumps/events/2024/01/events-2024-01-05.dump.
The tool supports three configuration methods (in order of precedence):
- Command-line flags (highest priority)
- Environment variables (prefix:
ARCHIVE_) - Configuration file (lowest priority)
export ARCHIVE_DB_HOST=localhost
export ARCHIVE_DB_PORT=5432
export ARCHIVE_DB_USER=myuser
export ARCHIVE_DB_PASSWORD=mypass
export ARCHIVE_DB_NAME=mydb
export ARCHIVE_DB_SSLMODE=disable
export ARCHIVE_S3_ENDPOINT=https://fsn1.your-objectstorage.com
export ARCHIVE_S3_BUCKET=my-bucket
export ARCHIVE_S3_ACCESS_KEY=your_key
export ARCHIVE_S3_SECRET_KEY=your_secret
export ARCHIVE_S3_PATH_TEMPLATE="archives/{table}/{YYYY}/{MM}"
export ARCHIVE_TABLE=flights
export ARCHIVE_OUTPUT_FORMAT=jsonl # Options: jsonl, csv, parquet
export ARCHIVE_COMPRESSION=zstd # Options: zstd, lz4, gzip, none
export ARCHIVE_COMPRESSION_LEVEL=3 # zstd: 1-22, lz4/gzip: 1-9
export ARCHIVE_OUTPUT_DURATION=daily # Options: hourly, daily, weekly, monthly, yearly
export ARCHIVE_WORKERS=8
export ARCHIVE_CACHE_VIEWER=true
export ARCHIVE_VIEWER_PORT=8080Create ~/.data-archiver.yaml:
db:
host: localhost
port: 5432
user: myuser
password: mypass
name: mydb
sslmode: disable # Options: disable, require, verify-ca, verify-full
s3:
endpoint: https://fsn1.your-objectstorage.com
bucket: my-archive-bucket
access_key: your_access_key
secret_key: your_secret_key
region: auto
path_template: "archives/{table}/{YYYY}/{MM}" # S3 path template with placeholders
table: flights
output_format: jsonl # Options: jsonl, csv, parquet
compression: zstd # Options: zstd, lz4, gzip, none
compression_level: 3 # zstd: 1-22, lz4/gzip: 1-9
output_duration: daily # Options: hourly, daily, weekly, monthly, yearly
workers: 8
start_date: "2024-01-01"
end_date: "2024-12-31"
cache_viewer: false # Enable embedded cache viewer
viewer_port: 8080 # Port for cache viewer web serverFiles are organized in S3 based on your configured --path-template. The tool supports flexible path templates with the following placeholders:
{table}- Table name{YYYY}- 4-digit year{MM}- 2-digit month{DD}- 2-digit day{HH}- 2-digit hour (for hourly duration)
Example with default settings (--path-template "archives/{table}/{YYYY}/{MM}" --output-format jsonl --compression zstd --output-duration daily):
bucket/
βββ archives/
βββ flights/
βββ 2024/
βββ 01/
βββ flights-2024-01-01.jsonl.zst
βββ flights-2024-01-02.jsonl.zst
βββ flights-2024-01-03.jsonl.zst
Example with Parquet and LZ4 (--path-template "data/{table}/year={YYYY}/month={MM}" --output-format parquet --compression lz4):
bucket/
βββ data/
βββ flights/
βββ year=2024/
βββ month=01/
βββ flights-2024-01-01.parquet.lz4
βββ flights-2024-01-02.parquet.lz4
βββ flights-2024-01-03.parquet.lz4
Example with uncompressed CSV (--path-template "{table}/{YYYY}" --output-format csv --compression none --output-duration monthly):
bucket/
βββ flights/
βββ 2024/
βββ flights-2024-01.csv
βββ flights-2024-02.csv
βββ flights-2024-03.csv
The archiver includes an embedded web server for monitoring cache and progress:
# Start archiver with embedded cache viewer
data-archiver --viewer --viewer-port 8080 [other options]
# Or run standalone cache viewer
data-archiver viewer --port 8080Features:
- WebSocket Real-time Updates: Live data streaming with automatic reconnection
- Interactive Status Panel:
- Shows current partition being processed with clickable link
- Displays specific operation (e.g., "Checking if exists", "Extracting", "Compressing", "Uploading")
- Progress bar with completion percentage and partition count
- Elapsed time tracking
- Visual Change Detection: Smooth animations highlight updated cells and stats
- S3 Upload Status: Shows which files are uploaded vs only processed locally
- Comprehensive Metrics: Shows both compressed and uncompressed sizes
- Compression Ratios: Visual display of space savings
- Error Tracking: Displays last error and timestamp for failed partitions
- Smart Rendering: No page flashing - only updates changed values
- Sortable Columns: Click any column header to sort (default: partition name)
- File Counts: Shows total partitions, processed, uploaded, and errors
- Process Monitoring: Checks if archiver is currently running via PID
- Connection Status: Visual indicator shows WebSocket connection state
Access the viewer at http://localhost:8080 (or your configured port).
The cache viewer uses modern web technologies for optimal performance:
- WebSocket Protocol: Bi-directional communication for instant updates
- Automatic Reconnection: Reconnects every 2 seconds if connection drops
- Event-Driven File Monitoring: Uses fsnotify for instant file change detection
- Efficient Updates: Only transmits and renders changed data
- No Polling Overhead: WebSocket eliminates the need for HTTP polling
The tool features a beautiful terminal UI with:
- Per-partition progress bar: Shows real-time progress for data extraction, compression, and upload
- Overall progress bar: Tracks completion across all partitions
- Live statistics: Displays elapsed time, estimated remaining time, and recent completions
- Row counter: Shows progress through large tables during extraction
The tool automatically discovers partitions matching these naming patterns:
-
Daily partitions (standard):
{base_table}_YYYYMMDD- Example:
flights_20240101,flights_20240102
- Example:
-
Daily partitions (with prefix):
{base_table}_pYYYYMMDD- Example:
flights_p20240101,flights_p20240102
- Example:
-
Monthly partitions:
{base_table}_YYYY_MM- Example:
flights_2024_01,flights_2024_02 - Note: Monthly partitions are processed as the first day of the month
- Example:
For example, if your base table is flights, the tool will find and process all of these:
flights_20240101(daily)flights_p20240102(daily with prefix)flights_2024_01(monthly)
Each row from the partition is exported as a single JSON object on its own line:
{"id":1,"flight_number":"AA123","departure":"2024-01-01T10:00:00Z"}
{"id":2,"flight_number":"UA456","departure":"2024-01-01T11:00:00Z"}Uses Facebook's Zstandard compression with:
- Multi-core parallel compression
- "Better Compression" preset for optimal size/speed balance
- Typically achieves 5-10x compression ratios on JSON data
Files are skipped if:
- They already exist in S3 with the same path
- The file size matches (prevents re-uploading identical data)
Enable debug mode for detailed output:
data-archiver --debug --table flights ...Debug mode shows:
- Database connection details
- Discovered partitions and row counts
- Extraction progress (every 10,000 rows)
- Compression ratios
- Upload destinations
- Detailed error messages
Test your configuration without uploading:
data-archiver --dry-run --table flights ...This will:
- Connect to the database
- Discover partitions
- Extract and compress data
- Calculate file sizes and MD5 hashes
- Skip the actual upload
The archiver uses an intelligent two-tier caching system to maximize performance:
- Caches partition row counts for 24 hours
- Speeds up progress bar initialization
- Always recounts today's partition for accuracy
- Cache files live under
~/.data-archiver/cache/and are namespaced by the subcommand plus the absolute S3 destination (for examplearchive_events_a1b2c3d4_metadata.json)
- Caches compressed/uncompressed sizes, MD5 hash, and S3 upload status
- Tracks whether files have been successfully uploaded to S3
- Enables fast skipping without extraction/compression on subsequent runs
- Validates against S3 metadata before skipping
- Preserves all metadata when updating row counts
- Stores error messages with timestamps for failed uploads
- File metadata is kept permanently (only row counts expire after 24 hours)
- Applies to
archive,dump(when using--start-date/--end-dateand--output-duration), and the date-windoweddump-hybridstep so reruns skip windows that already exist in S3
On subsequent runs with cached metadata:
- Check cached size/MD5 against S3 (milliseconds)
- Skip extraction and compression if match found
- Result: 100-1000x faster for already-processed partitions
The archiver provides real-time monitoring capabilities:
- Creates PID file at
~/.data-archiver/archiver.pidwhen running - Allows external tools to check if archiver is active
- Automatically cleaned up on exit
- Writes current task details to
~/.data-archiver/current_task.json - Includes:
- Current operation (connecting, counting, extracting, uploading)
- Progress percentage
- Total and completed partitions
- Start time and last update time
- Updated in real-time during processing
The cache viewer provides REST API and WebSocket endpoints:
/api/cache- Returns all cached metadata (REST)/api/status- Returns archiver running status and current task (REST)/ws- WebSocket endpoint for real-time updates- Sends cache updates when files change
- Streams status updates during archiving
- Automatic reconnection support
The archiver ensures data integrity through multiple verification methods:
- Calculates MD5 hash of compressed data
- Compares with S3 ETag (which is MD5 for single-part uploads)
- Only skips if both size and MD5 match exactly
- Automatically uses multipart upload for large files
- Calculates multipart ETag using S3's algorithm
- Verifies size and multipart ETag match before skipping
- First Run: Extract β Compress β Calculate MD5 β Upload β Cache metadata
- Subsequent Runs with Cache: Check cache β Compare with S3 β Skip if match
- Subsequent Runs without Cache: Extract β Compress β Calculate MD5 β Compare with S3 β Skip or upload
data-archiver \
--table events \
--start-date $(date -d '30 days ago' +%Y-%m-%d) \
--config ~/.archive-config.yamldata-archiver \
--table transactions \
--start-date 2024-06-01 \
--end-date 2024-06-30 \
--debug \
--workers 8data-archiver \
--config production.yaml \
--table orders \
--dry-run \
--debugThe dump subcommand uses PostgreSQL's pg_dump utility to create database dumps with custom format and heavy compression, streaming directly to S3.
data-archiver dump \
--db-user myuser \
--db-password mypass \
--db-name mydb \
--s3-endpoint https://fsn1.your-objectstorage.com \
--s3-bucket my-archive-bucket \
--s3-access-key YOUR_ACCESS_KEY \
--s3-secret-key YOUR_SECRET_KEY \
--path-template "dumps/{table}/{YYYY}/{MM}"data-archiver dump \
--db-user myuser \
--db-password mypass \
--db-name mydb \
--dump-mode schema-only \
--s3-endpoint https://fsn1.your-objectstorage.com \
--s3-bucket my-archive-bucket \
--s3-access-key YOUR_ACCESS_KEY \
--s3-secret-key YOUR_SECRET_KEY \
--path-template "dumps/{table}/{YYYY}/{MM}"data-archiver dump \
--db-user myuser \
--db-password mypass \
--db-name mydb \
--dump-mode data-only \
--workers 8 \
--s3-endpoint https://fsn1.your-objectstorage.com \
--s3-bucket my-archive-bucket \
--s3-access-key YOUR_ACCESS_KEY \
--s3-secret-key YOUR_SECRET_KEY \
--path-template "dumps/{table}/{YYYY}/{MM}"data-archiver dump \
--db-user myuser \
--db-password mypass \
--db-name mydb \
--table flights \
--workers 4 \
--s3-endpoint https://fsn1.your-objectstorage.com \
--s3-bucket my-archive-bucket \
--s3-access-key YOUR_ACCESS_KEY \
--s3-secret-key YOUR_SECRET_KEY \
--path-template "dumps/{table}/{YYYY}/{MM}"--db-host- PostgreSQL host (default: localhost)--db-port- PostgreSQL port (default: 5432)--db-user- PostgreSQL user (required)--db-password- PostgreSQL password (required)--db-name- PostgreSQL database name (required)--db-sslmode- PostgreSQL SSL mode (disable, require, verify-ca, verify-full)--s3-endpoint- S3-compatible endpoint URL (required)--s3-bucket- S3 bucket name (required)--s3-access-key- S3 access key (required)--s3-secret-key- S3 secret key (required)--s3-region- S3 region (default: auto)--path-template- S3 path template with placeholders (required)--table- Table name to dump (optional, dumps entire database if not specified)--workers- Number of parallel jobs for pg_dump (default: 4)--dump-mode- Dump mode:schema-only,data-only, orschema-and-data(default: schema-and-data)
- Custom Format: Uses PostgreSQL's custom format (
-Fc) which supports parallel dumps and compression - Heavy Compression: Uses maximum compression level (
-Z 9) for optimal file size - Parallel Processing: Honors the
--workersflag to run multiple parallel jobs - Streaming Upload: Streams output directly to S3 without creating intermediate files
- Flexible Modes: Supports schema-only, data-only, or both schema and data
- Table-Specific: Can dump individual tables or entire databases
- Schema-Only Optimization: For schema-only dumps:
- Automatically discovers and dumps only top-level tables (excludes partitions)
- Partitions share the same schema as their parent table, so scanning them is unnecessary
- Use
--tableflag to dump a specific table's schema - Without
--table, dumps schemas for all top-level tables
- Automatic Naming: Generates filenames with timestamp and mode suffix (e.g.,
flights-schema-20240115-120000.dump)
Dump entire database:
data-archiver dump \
--db-user myuser \
--db-password mypass \
--db-name mydb \
--path-template "dumps/{table}/{YYYY}/{MM}"Dump specific table with parallel processing:
data-archiver dump \
--db-user myuser \
--db-password mypass \
--db-name mydb \
--table flights \
--workers 8 \
--path-template "dumps/{table}/{YYYY}/{MM}"Dry run dump (validate without uploading):
data-archiver dump \
--db-user myuser \
--db-password mypass \
--db-name mydb \
--dry-run \
--path-template "dumps/{table}/{YYYY}/{MM}"The restore subcommand reverses the archive process: downloads files from S3, decompresses them, parses formats (JSONL/CSV/Parquet), and inserts data into PostgreSQL tables with automatic table/partition creation.
data-archiver restore \
--table flights \
--path-template "archives/{table}/{YYYY}/{MM}" \
--start-date 2024-01-01 \
--end-date 2024-01-31data-archiver restore \
--table flights \
--path-template "archives/{table}/{YYYY}/{MM}" \
--table-partition-range daily \
--start-date 2024-01-01 \
--end-date 2024-01-31--table- Base table name (required)--path-template- S3 path template matching archive configuration (required)--start-date- Start date filter (YYYY-MM-DD, optional)--end-date- End date filter (YYYY-MM-DD, optional)--table-partition-range- Partition range:hourly,daily,monthly,quarterly,yearly(optional)--output-format- Override format detection:jsonl,csv,parquet(optional, auto-detected from file extensions)--compression- Override compression detection:zstd,lz4,gzip,none(optional, auto-detected from file extensions)
- Automatic Format Detection: Detects format and compression from file extensions (
.jsonl.zst,.csv.lz4,.parquet.gz, etc.) - Automatic Table Creation: Creates tables automatically if they don't exist, inferring schema from data
- Partition Support: Automatically creates partitions based on
--table-partition-range:hourly: Creates partitions liketable_2024010115daily: Creates partitions liketable_20240101monthly: Creates partitions liketable_202401quarterly: Creates partitions liketable_2024Q1yearly: Creates partitions liketable_2024
- Conflict Handling: Uses
ON CONFLICT DO NOTHINGto skip existing rows - Date Range Filtering: Only restores files matching the specified date range
- Sequential Processing: Processes files one at a time (parallel support may be added later)
Restore all files for a table:
data-archiver restore \
--table flights \
--path-template "archives/{table}/{YYYY}/{MM}"Restore specific date range with daily partitions:
data-archiver restore \
--table flights \
--path-template "archives/{table}/{YYYY}/{MM}" \
--table-partition-range daily \
--start-date 2024-01-01 \
--end-date 2024-01-31Restore with format override:
data-archiver restore \
--table flights \
--path-template "archives/{table}/{YYYY}/{MM}" \
--output-format parquet \
--compression zstdDry run restore (validate without inserting):
data-archiver restore \
--table flights \
--path-template "archives/{table}/{YYYY}/{MM}" \
--dry-runThe tool provides detailed error messages for common issues:
- Database Connection: Checks connectivity before processing
- Partition Discovery: Reports invalid partition formats
- Data Extraction: Handles large datasets with streaming
- Compression: Reports compression failures and ratios
- S3 Upload: Retries on transient failures
- Configuration: Validates all required parameters
- Increase Workers: Use
--workersto process more partitions in parallel - Network: Ensure good bandwidth to S3 endpoint
- Database: Add indexes on date columns for faster queries
- Memory Management:
- Tool uses streaming architecture with constant ~150 MB memory footprint
- Memory usage independent of partition size (no OOM on multi-GB partitions)
- Tune
--chunk-sizebased on average row size:- Small rows (~1 KB):
--chunk-size 50000(~50 MB) - Medium rows (~10 KB):
--chunk-size 10000(~100 MB, default) - Large rows (~100 KB):
--chunk-size 1000(~100 MB) - Very large rows (1+ MB):
--chunk-size 100(~100 MB)
- Small rows (~1 KB):
- Compression: Multi-core zstd scales with CPU cores
The project includes a comprehensive test suite covering:
- Cache Operations: Row count and file metadata caching, TTL expiration, legacy migration
- Configuration Validation: Required fields, default values, date formats
- Process Management: PID file operations, task tracking, process status checks
Run tests with:
# Run all tests
go test ./...
# Run with verbose output
go test -v ./...
# Run with coverage
go test -cover ./...
# Run specific tests
go test -run TestPartitionCache ./cmdBuild and run with Docker:
# Build the Docker image
docker build -t data-archiver .
# Run with environment variables
docker run --rm \
-e ARCHIVE_DB_HOST=host.docker.internal \
-e ARCHIVE_DB_USER=myuser \
-e ARCHIVE_DB_PASSWORD=mypass \
-e ARCHIVE_DB_NAME=mydb \
-e ARCHIVE_S3_ENDPOINT=https://s3.example.com \
-e ARCHIVE_S3_BUCKET=my-bucket \
-e ARCHIVE_S3_ACCESS_KEY=key \
-e ARCHIVE_S3_SECRET_KEY=secret \
-e ARCHIVE_TABLE=events \
data-archiver
# Run with config file
docker run --rm \
-v ~/.data-archiver.yaml:/root/.data-archiver.yaml \
data-archiver- Go 1.21+
- PostgreSQL database for testing
- S3-compatible storage for testing
# Clone the repository
git clone https://github.com/airframesio/data-archiver.git
cd data-archiver
# Install Go dependencies
go mod download
# Install Node.js dependencies (for web asset minification)
npm install
# Minify web assets (CSS, JavaScript, HTML)
npm run minify
# Build the binary (with minified assets embedded)
go build -o data-archiver
# Or use the npm build script which minifies and builds in one command
npm run build
# Run tests
go test ./...
# Build for different platforms
GOOS=linux GOARCH=amd64 go build -o data-archiver-linux-amd64
GOOS=darwin GOARCH=arm64 go build -o data-archiver-darwin-arm64
GOOS=windows GOARCH=amd64 go build -o data-archiver.exeThe cache viewer web UI uses minified assets in production builds to reduce load times:
- Original size: 98,389 bytes (HTML + CSS + JS + design system)
- Minified size: 60,995 bytes (38% reduction)
The minification process:
- Uses
csso-clifor CSS minification - Uses
terserfor JavaScript minification with mangling and compression - Uses
html-minifier-terserfor HTML minification - Automatically runs in CI/CD before building binaries
To minify manually:
# Run the minification script
./scripts/minify.sh
# Or use npm
npm run minifyThe minified files are automatically embedded into the Go binary during build.
The project uses GitHub Actions for continuous integration:
- Test Matrix: Tests on Go 1.21.x and 1.22.x
- Platforms: Linux, macOS, Windows
- Coverage: Runs tests with coverage reporting
- Linting: Ensures code quality with golangci-lint
- Binary Builds: Creates binaries for multiple platforms
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT License - see LICENSE file for details
Built with these awesome libraries:
- Charmbracelet - Beautiful CLI components
- Cobra - CLI framework
- Viper - Configuration management
- klauspost/compress - Fast zstd compression
- AWS SDK for Go - S3 integration
- Gorilla WebSocket - WebSocket implementation
- fsnotify - Cross-platform file system notifications

