This is an experimental binding using the s3s crate backed by SQLite.
HIGHLY EXPERIMENTAL do not use in production as only basic tests have been run and data loss cannot be guaranteed.
This crate was built to test the feasiblity of using SQLite as an alternative to formats like TFRecord for storing a large number of machine learning training data with an accessible API. Although TFRecord and SQLite share a similar few-large-storage-files architecture - making backups and data movement efficient/easy - having both a SQLite interface and Amazon S3/http API makes access to the contained data easier with tooling that likely already exists on your machine.
This concept is backed by benchmarks from SQLite showing that it can be faster than filesytems for certain data access patterns.
Each bucket is saved to a separate .sqlite3 database named after the bucket name. The smithy generated bindings for s3 are then mapped to the correct SQL calls against a very simple schema that is designed to be human accessible.
content-md5 verification (if available) and SQLite database transactions are used to prevent data loss or partial updates.
The main table, data, is a simple key/value table with a separate metadata store. metadata is split from data as it was found to be more performant for large list_objects calls presumably due to it being able to be cached by the SQLite engine. This idea was inspired by BadgerDB who implemented ideas from the WISCKEY paper and saw big wins with separating values from keys.
CREATE TABLE IF NOT EXISTS data (
key TEXT PRIMARY KEY,
value BLOB
);
CREATE TABLE IF NOT EXISTS metadata (
key TEXT PRIMARY KEY,
size INTEGER NOT NULL,
metadata TEXT,
last_modified TEXT NOT NULL,
md5 TEXT,
FOREIGN KEY (key) REFERENCES data (key) ON DELETE CASCADE
) WITHOUT ROWID;For multipart uploads two temporary tables are used:
CREATE TABLE IF NOT EXISTS multipart_upload (
upload_id BLOB NOT NULL PRIMARY KEY,
bucket TEXT NOT NULL,
key TEXT NOT NULL,
last_modified TEXT NOT NULL,
access_key TEXT,
UNIQUE(upload_id, bucket, key)
);
CREATE TABLE IF NOT EXISTS multipart_upload_part (
upload_id BLOB NOT NULL,
last_modified TEXT NOT NULL,
part_number INTEGER NOT NULL,
value BLOB NOT NULL,
size INTEGER NOT NULL,
md5 TEXT,
PRIMARY KEY (upload_id, part_number),
FOREIGN KEY (upload_id) REFERENCES multipart_upload (upload_id) ON DELETE CASCADE
);s3ite provides configuration options at the service level (i.e. the global level that apply to all buckets or control the API behavior) or at the bucket level for changing specific bucket behavior. To set them sqlite has two methods of configuration: a yaml configuration file or the command-line-interface.
The yaml configuration (provided via the --config argument) differs from the command-line-interface options in that it allows parameters be specified at the service and bucket levels whereas the command-line-interface provides only service level configuration.
The options are:
root: The base path where the.sqlite3files will be created.host: The IP address to listen on for this service.port: The port to listen on for this service.access_key: The access key ID that is used to authenticate for this service.secret_key: The secret access key that is used to authenticate for this service.concurrency_limit: Enforces a limit on the concurrent number of requests the underlying service can handle. This can be tuned depending on infrastructure as SSD/HDD will handle resource contention very differently.permissive_cors: Allow permissive Cross-Origin Resource Sharing (CORS) requests. This can be enabled to allow users to access this service from a web service running on a different host.domain_name: The domain to use to allow parsing virtual-hosted-style requests.read_only: Prevent mutations to any of the databases connected to this service.journal_mode: Controls the default SQLite journal_mode pragma.synchronous: Controls the default SQLite synchronous pragma.temp_store: Controls the default SQLite temp_store pragma.cache_size: Controls the default SQLite cache_size pragma.
This structure is heirarchical where:
command-line-interface arguments will take precedence over any service level configuration like below where the default root value . is overridden with /data.
s3ite --root /dataIf set, bucket level configurations will take precedence over the service level configurations.
This design allows setting specific bucket level configurations like below where the mybucket bucket will be set to read-only on startup and all other buckets will be writable. An error will be raised if any specified bucket does not exist (in this case and error is raised if mybucket.sqlite3 is not found and accessible).
root: /data
host: 0.0.0.0
port: 8014
access_key: AKIAIOSFODNN7EXAMPLE
secret_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
permissive_cors: true
concurrency: 16
read_only: false
journal_mode: WAL
synchronous: NORMAL
temp_store: MEMORY
cache_size: 67108864
buckets:
mybucket:
read_only: true
sqlite:
cache_size: 134217728docker run --rm \
-e RUST_LOG=info \
-v $(pwd)/test:/data \
-p 8014:8014 \
s3ite:latest \
./s3ite \
--root /data \
--host 0.0.0.0 \
--port 8014 \
--concurrency-limit 16 \
--access-key AKIAIOSFODNN7EXAMPLE \
--secret-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEYThis code can be used as a library or a standalone binary. To build the binary:
cargo build --releasecargo install --path .s3ite --root . --host 0.0.0.0 --port 8014 --access-key AKIAIOSFODNN7EXAMPLE --secret-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY