Skip to content

Conversation

@nikhilsinhaparseable
Copy link
Contributor

@nikhilsinhaparseable nikhilsinhaparseable commented Nov 11, 2025

instead of using file_size from manifest -- which is size of json
we should use ingestion_size -- which is compressed size

Summary by CodeRabbit

  • Bug Fixes
    • Updated file size metric source in table operations for more accurate byte count reporting.

instead of using file_size from manifest -- which is size of json
we should use ingestion_size -- which is compressed size
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 11, 2025

Walkthrough

The StandardTableProvider::partitioned_files method now uses ingestion_size instead of file_size for byte counting. The field destructuring and total file size aggregation have been updated to reference the new field, with no other logic changes.

Changes

Cohort / File(s) Summary
Field source update
src/query/stream_schema_provider.rs
Changed File destructuring and total_file_size aggregation to use ingestion_size field instead of file_size in StandardTableProvider::partitioned_files

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Verify that ingestion_size is the correct field for byte counting in this context (ensure semantic correctness)
  • Confirm no other references to file_size in related code paths need updating
  • Check if any documentation or comments reference the old behavior

Poem

A rabbit hops through field and stream,
Swapping sizes in the schema's dream,
From file_size to ingestion's call,
Counting bytes more fairly for all! 🐰📊

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description explains the change rationale but lacks required template sections: missing issue reference, incomplete description structure, and missing testing/documentation checkboxes. Add issue reference (Fixes #XXXX), expand description with solution rationale, and include the template checklist for testing and documentation.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: bytes scanned in query' directly matches the main change: correcting the byte counting metric used for query scanning.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8cfced3 and 31f0062.

📒 Files selected for processing (1)
  • src/query/stream_schema_provider.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-08-18T19:10:11.941Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1405
File: src/handlers/http/ingest.rs:163-164
Timestamp: 2025-08-18T19:10:11.941Z
Learning: Field statistics calculation in src/storage/field_stats.rs uses None for the time_partition parameter when calling flatten_and_push_logs(), as field stats generation does not require time partition functionality.

Applied to files:

  • src/query/stream_schema_provider.rs
📚 Learning: 2025-09-18T09:59:20.177Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metrics/mod.rs:700-756
Timestamp: 2025-09-18T09:59:20.177Z
Learning: In src/event/mod.rs, the parsed_timestamp used in increment_events_ingested_by_date() is correctly UTC-normalized: for dynamic streams it remains Utc::now(), and for streams with time partition enabled it uses the time partition value. Both cases result in proper UTC date strings for metrics labeling, preventing double-counting issues.

Applied to files:

  • src/query/stream_schema_provider.rs
📚 Learning: 2025-08-25T01:31:41.786Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metadata.rs:63-68
Timestamp: 2025-08-25T01:31:41.786Z
Learning: The TOTAL_EVENTS_INGESTED_DATE, TOTAL_EVENTS_INGESTED_SIZE_DATE, and TOTAL_EVENTS_STORAGE_SIZE_DATE metrics in src/metadata.rs and src/storage/object_storage.rs are designed to track total events across all streams, not per-stream. They use labels [origin, parsed_date] to aggregate by format and date, while per-stream metrics use [stream_name, origin, parsed_date] labels.

Applied to files:

  • src/query/stream_schema_provider.rs
📚 Learning: 2025-08-25T01:32:25.980Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/metrics/mod.rs:163-173
Timestamp: 2025-08-25T01:32:25.980Z
Learning: The TOTAL_EVENTS_INGESTED_DATE, TOTAL_EVENTS_INGESTED_SIZE_DATE, and TOTAL_EVENTS_STORAGE_SIZE_DATE metrics in src/metrics/mod.rs are intentionally designed to track global totals across all streams for a given date, using labels ["format", "date"] rather than per-stream labels. This is the correct design for global aggregation purposes.

Applied to files:

  • src/query/stream_schema_provider.rs
📚 Learning: 2025-09-18T09:52:07.554Z
Learnt from: nikhilsinhaparseable
Repo: parseablehq/parseable PR: 1415
File: src/storage/object_storage.rs:173-177
Timestamp: 2025-09-18T09:52:07.554Z
Learning: In Parseable's upload system (src/storage/object_storage.rs), the update_storage_metrics function can safely use path.metadata().map_err() to fail on local file metadata read failures because parquet validation (validate_uploaded_parquet_file) ensures file integrity before this step, and the system guarantees local staging files remain accessible throughout the upload flow.

Applied to files:

  • src/query/stream_schema_provider.rs
🧬 Code graph analysis (1)
src/query/stream_schema_provider.rs (1)
src/catalog/mod.rs (2)
  • ingestion_size (58-58)
  • ingestion_size (70-72)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
🔇 Additional comments (1)
src/query/stream_schema_provider.rs (1)

342-342: Approve the metric tracking changes.

The updates to use ingestion_size for billing metrics correctly implement the PR objective of tracking compressed parquet file sizes instead of uncompressed JSON sizes.

Also applies to: 348-348


Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant