Skip to content

Implement query-like feature to extract chunk data by namespace and entry key #104

@joaosreis

Description

@joaosreis

We need to implement a query-like feature that can efficiently retrieve chunk data given a namespace and an entry key.

-- Example API (to be refined during implementation)
queryEntry ::
  (HasKey a, MemPack a, Typeable a)
  => FilePath
  -> Namespace
  -> Key a
  -> IO (Maybe a)

queryEntries ::
  (HasKey a, MemPack a, Typeable a)
  => FilePath
  -> Namespace
  -> [Key a]
  -> IO (Map (Key a) a)

queryRange ::
  (HasKey a, MemPack a, Typeable a)
  => FilePath
  -> Namespace
  -> Key a  -- start (inclusive)
  -> Key a  -- end (exclusive)
  -> S.Stream (S.Of a) IO ()

Implementation Considerations

  1. Index Design
  • Option A: Mapping keys to chunk offsets (using INDEX records)
    • Pros: Fast lookups without scanning chunks
    • Cons: Additional INDEX record maintenance, consistency concerns
  • Option B: Leverage sorted order + chunk metadata
    • Pros: No additional INDEX record needed, leverages existing sort
    • Cons: Requires scanning chunk headers, may need key range metadata
  • Option C: In-memory index built on first access
    • Pros: Fast subsequent queries
    • Cons: Initial scan cost, memory usage
  1. Optimizations
  • Add key range metadata to INDEX records (min/max key per chunk)
  • Enable binary search over chunks for key ranges
  • Skip chunks that don't contain the target key range
  • Bloom filters for existence checks (also mentioned in the CIP)
  1. Entry-Level Access
  • Currently chunks are decoded as ByteStrings with entries extracted via unpackLeftOver
  • Searching for anything other than the entry key is currently unattainable?
  1. Streaming vs. Random Access
  • Should query results be streamed or materialized?
  • How to handle large result sets efficiently?
  • Trade-offs between memory usage and I/O

Technical Requirements

  1. Performance Goals
  • O(log n) chunk lookup for single key queries
  • Minimize unnecessary chunk decoding
  • Efficient range queries for sorted data
  1. API Design
  • Consistent with existing Reader module patterns
  • Support both single-key and batch queries
  • Provide streaming interface for range queries

Implementation Steps

  1. Research & Design Phase
  • Design query API
  1. Prototype Phase
  • Implement basic key-based lookup (even if inefficient)
  • Test with sample data
  1. Optimization Phase
  • Implement INDEX record creation and usage for searching
  1. Integration Phase
  • Add comprehensive tests
  • Document query performance characteristics
  • Update examples and user documentation

Acceptance Criteria

[ ] Query API defined and documented
[ ] Efficient key-based lookup implemented
[ ] Tests covering single key, batch, and range queries
[ ] Documentation with usage examples

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions