Reading and Writing very large Protocol Buffers encoded files

Platform/language neutral Protocol Buffer based file format specification designed to be stream writable/readable.

Problem Statement

Protocol Buffer library is not designed for parsing large messages.

Solution

File format specification which allows to define custom dataset file formats based on length-prefixed, protobuf serialized payloads. This allows files to be read/written in a streaming fasion, thus not requiring to hold the entire dataset in physical memory of the process.

File Byte Layout

All integers are 4 byte, Signed, Big Endian encoded
Magic Byte must be constant 0x1973
Header and Payload are opaque, protobuf serialized byte arrays

---
title: "File Byte Layout"
---
packet-beta
0-3: "Magic Byte(Int32)"
4-8: "Header Length (Int32)"
9-18: "Header content (variable)"
19-23: "Payload length (Int32)"
24-39: "Payload (variable)"
40-44: "Payload length (Int32)"
45-62: "Payload (variable)"
63-67: "File Seal (Int32)"

Indexing Support

This format supports efficient random access to records using a B+Tree index file. The index maps keys (such as entity names) to payload offsets in the data file. This enables fast lookups without scanning the entire file.

Index File: A separate file (e.g., name.index) stores a B+Tree mapping keys to payload offsets.
Random Access: Use the index to retrieve the offset, then seek directly to the payload in the data file.
Multi-language: Both Java and Python implementations are provided for reading and searching the index.

See:

S3 Remote Read Support

The format and libraries support reading both data and index files directly from Amazon S3 using efficient range requests.

S3 Range Reads: Only the required bytes are fetched from S3, minimizing bandwidth and memory usage.
Streaming: Both header and payloads can be read without downloading the full file.
Index and Data: Both the index and data files can be accessed remotely.

See:

Supported Languages

Any language which supports reading/writing files as a stream and for which protobuf bindings can be generated.

Tested Languages

Java
Python

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
app		app
gradle		gradle
protoc		protoc
python-app		python-app
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
PriceEntity.proto		PriceEntity.proto
PricesStreamedFileHeader.proto		PricesStreamedFileHeader.proto
PricesStreamedFilePayload.proto		PricesStreamedFilePayload.proto
README.md		README.md
SeekableS3Datasets.md		SeekableS3Datasets.md
generateJavaBindings.sh		generateJavaBindings.sh
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
price.descr		price.descr
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reading and Writing very large Protocol Buffers encoded files

Problem Statement

Solution

File Byte Layout

Indexing Support

S3 Remote Read Support

Supported Languages

Tested Languages

Files to look at

About

Uh oh!

Languages

License

unclepaul84/streaming-protos

Folders and files

Latest commit

History

Repository files navigation

Reading and Writing very large Protocol Buffers encoded files

Problem Statement

Solution

File Byte Layout

Indexing Support

S3 Remote Read Support

Supported Languages

Tested Languages

Files to look at

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages