Platform/language neutral Protocol Buffer based file format specification designed to be stream writable/readable.
Protocol Buffer library is not designed for parsing large messages.
File format specification which allows to define custom dataset file formats based on length-prefixed, protobuf serialized payloads. This allows files to be read/written in a streaming fasion, thus not requiring to hold the entire dataset in physical memory of the process.
- All integers are 4 byte, Signed, Big Endian encoded
- Magic Byte must be constant 0x1973
- Header and Payload are opaque, protobuf serialized byte arrays
---
title: "File Byte Layout"
---
packet-beta
0-3: "Magic Byte(Int32)"
4-8: "Header Length (Int32)"
9-18: "Header content (variable)"
19-23: "Payload length (Int32)"
24-39: "Payload (variable)"
40-44: "Payload length (Int32)"
45-62: "Payload (variable)"
63-67: "File Seal (Int32)"
This format supports efficient random access to records using a B+Tree index file. The index maps keys (such as entity names) to payload offsets in the data file. This enables fast lookups without scanning the entire file.
- Index File: A separate file (e.g.,
name.index) stores a B+Tree mapping keys to payload offsets. - Random Access: Use the index to retrieve the offset, then seek directly to the payload in the data file.
- Multi-language: Both Java and Python implementations are provided for reading and searching the index.
See:
The format and libraries support reading both data and index files directly from Amazon S3 using efficient range requests.
- S3 Range Reads: Only the required bytes are fetched from S3, minimizing bandwidth and memory usage.
- Streaming: Both header and payloads can be read without downloading the full file.
- Index and Data: Both the index and data files can be accessed remotely.
See:
Any language which supports reading/writing files as a stream and for which protobuf bindings can be generated.
- Java
- Python