feat: Replace legacy spillover logic with Waterfall LRU architecture #219

hazemawadalla · 2025-12-09T16:04:53Z

This is a major architectural upgrade to the core benchmark logic. Replacing the original "Spillover" memory management strategy with the new "Waterfall LRU" implementation to accurately simulate enterprise storage hierarchies.

Key Changes:

Waterfall Eviction: Implemented recursive eviction (GPU -> CPU -> NVMe). New data now correctly lands in the fastest available tier, pushing cold data down, rather than the old behavior where new data skipped directly to NVMe if RAM was full.
Static Buffer Optimization: Replaced the CPU-bound np.random generation with a pre-allocated static noise buffer. This removes the CPU bottleneck that was masking true storage latency, allowing us to fully saturate high-performance NVMe drives.
Concurrency Hardening: Added semaphore-based concurrency limits (max_concurrent_allocs) and atomic memory reservations to prevent OOM crashes under heavy load.
Storage Metrics: Added explicit tracking for nvme_tokens_processed to calculate true storage throughput separate from system throughput.
Stress Test Validation: Verified that this new architecture correctly exposes storage latency limits (e.g., pushing P95 write latency >1000ms) where the old script artificially throttled the load.

github-actions · 2025-12-09T16:05:03Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

This is a major architectural upgrade to the core benchmark logic. Replacing the original "Spillover" memory management strategy with the new "Waterfall LRU" implementation to accurately simulate enterprise storage hierarchies. Key Changes: - Waterfall Eviction: Implemented recursive eviction (GPU -> CPU -> NVMe). New data now correctly lands in the fastest available tier, pushing cold data down, rather than the old behavior where new data skipped directly to NVMe if RAM was full. - Static Buffer Optimization: Replaced the CPU-bound np.random generation with a pre-allocated static noise buffer. This removes the CPU bottleneck that was masking true storage latency, allowing us to fully saturate high-performance NVMe drives. - Concurrency Hardening: Added semaphore-based concurrency limits (max_concurrent_allocs) and atomic memory reservations to prevent OOM crashes under heavy load. - Storage Metrics: Added explicit tracking for nvme_tokens_processed to calculate true storage throughput separate from system throughput. - Stress Test Validation: Verified that this new architecture correctly exposes storage latency limits (e.g., pushing P95 write latency >1000ms) where the old script artificially throttled the load.

wvaske

approved

This patch addresses two bugs that surface when running the benchmark with --enable-rag: 1. Race condition in process_requests (line 2693) Worker threads begin processing requests immediately upon benchmark start, while RAG document ingestion runs in a separate daemon thread. When a worker hits the 10% RAG query path before any documents have been ingested, random.choice() is called on an empty list, raising IndexError. Fixed by adding a truthiness check on self.rag_manager.documents before entering the RAG code path. An empty dict evaluates to False, so RAG queries are safely skipped until ingestion populates at least one document. 2. Division by zero in KVCacheGenerator.generate (line 1097) The buffer slicing logic uses modulo to compute a pseudo-random start index: seed % (buffer_size - total_elements). When total_elements exactly equals buffer_size (an edge case permitted by the <= guard), the divisor becomes zero, raising ZeroDivisionError. Fixed by computing the divisor separately and defaulting start_idx to 0 when the divisor is zero.

…che.py

…f invocation

… 4G of DRAM to reduce Queue contention and unrealistic read amplification

hazemawadalla requested a review from a team December 9, 2025 16:04

hazemawadalla requested a review from a team as a code owner December 9, 2025 16:04

hazemawadalla force-pushed the TF_KVCache branch from 0ee276d to 073fe61 Compare December 9, 2025 17:00

hazemawadalla mentioned this pull request Dec 9, 2025

KVBench - New KVs should be written to the top layer and trigger eviction from a tier if sufficient space doesn't exist. #217

Open

wvaske approved these changes Dec 19, 2025

View reviewed changes

hazemawadalla added 4 commits December 19, 2025 09:25

Add detailed README.md for running the different invocations of kv-ca…

f78bf60

…che.py

fix: line endings from dos2unix; increase cpu memory to 4GB for mlper…

2464edf

…f invocation

Update MLperf v3 KV cache proposal.md to recommend using a minimum of…

70b8f69

… 4G of DRAM to reduce Queue contention and unrealistic read amplification

idevasena approved these changes Dec 19, 2025

View reviewed changes

johnugeorge merged commit 92d5e89 into mlcommons:TF_KVCache Dec 22, 2025
1 check passed

github-actions bot locked and limited conversation to collaborators Dec 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Replace legacy spillover logic with Waterfall LRU architecture #219

feat: Replace legacy spillover logic with Waterfall LRU architecture #219

Uh oh!

hazemawadalla commented Dec 9, 2025

Uh oh!

github-actions bot commented Dec 9, 2025 •

edited

Loading

Uh oh!

wvaske left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: Replace legacy spillover logic with Waterfall LRU architecture #219

feat: Replace legacy spillover logic with Waterfall LRU architecture #219

Uh oh!

Conversation

hazemawadalla commented Dec 9, 2025

Uh oh!

github-actions bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wvaske left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Dec 9, 2025 •

edited

Loading