scx_mitosis: Add Optional LLC Awareness and Work Stealing #3146

tommy-u · 2025-12-12T01:38:44Z

Summary

Mitosis previously wasn't topology aware, dispatching tasks to a single DSQ per cell. This work adds
LLC (Last Level Cache) domain awareness to scx_mitosis. When enabled via --enable-llc-awareness,
tasks are assigned to LLC domains within their cell and preferentially scheduled on CPUs sharing that LLC.
Cross-LLC work stealing allows for some degree of load balancing and work conservation, with
configurable throttling to prevent excessive thread migrations.

Examples

# Original behavior, no LLC awareness
scx_mitosis

# LLC-aware scheduling, no work stealing
scx_mitosis --enable-llc-awareness

# LLC-aware with work stealing (recommended)
scx_mitosis --enable-llc-awareness --enable-work-stealing

# LLC-aware with throttled stealing (skip N steal opportunities per task)
scx_mitosis --enable-llc-awareness --enable-work-stealing --steal-throttle 2

Changes

BPF Side

DSQ typed encoding: Introduced struct dsq_type and encoding helpers to create unique DSQ IDs that embed both cell index and LLC index, enabling per-(cell, LLC) dispatch queues.
LLC-aware task assignment: Tasks are assigned to an LLC within their cell using weighted random selection based on CPU count per LLC. The task's cpumask is narrowed to CPUs in its assigned LLC domain.
Per-LLC vtime tracking: Each cell now maintains llc_vtime_now[MAX_LLCS] for fair scheduling within LLC domains, replacing the single cell-wide vtime.
Work stealing: Idle CPUs can steal tasks from sibling LLC DSQs within the same cell. Stolen tasks are retagged to the thief's LLC on next running() call. Stealing is throttled via --steal-throttle to prevent excessive cross-LLC migrations.
CPU context caching: The LLC index is cached in cpu_ctx->llc during init to avoid map lookups in the hot dispatch path.
Steal statistics: Steal events are tracked as CSTAT_STEAL in per-cell cstats.
Header restructuring: LLC-aware logic is isolated in llc_aware.bpf.h with mitosis.bpf.h providing shared types and utilities.

Rust Side

Topology initialization: Populates cpu_to_llc and llc_to_cpus maps from system topology before BPF init.
CLI options: Added --enable-llc-awareness, --enable-work-stealing, and --steal-throttle flags to control the feature.
Cell LLC recalculation: Calls recalc_cell_llc_counts() after cell cpumask changes to update per-LLC CPU counts.

Unchanged Behavior

When --enable-llc-awareness is not set:

Tasks dispatch to flat per-cell DSQs as before
No LLC domain assignment or cross-LLC stealing occurs
FAKE_FLAT_CELL_LLC (0) is used for DSQ encoding

Convert DSQ identifiers from raw u32 to typed dsq_id_t to prepare for L3 cache awareness. Replace old helper functions (cpu_dsq, cell_dsq, dsq_to_cpu) with new typed versions (get_cpu_dsq_id, get_cell_l3_dsq_id, get_cpu_from_dsq). L3 is hardcoded to DUMMY_L3 (0) for now, so scheduler behavior is unchanged. This validates the DSQ encoding mechanism before adding actual L3 awareness in subsequent commits. Changes: - Add MAX_L3S constant to intf.h - Include dsq.bpf.h for typed DSQ helpers - Convert struct task_ctx.dsq from u32 to dsq_id_t - Update all DSQ operations to use .raw when calling BPF functions - Create DSQs using get_cell_l3_dsq_id(cell, DUMMY_L3) instead of raw cell index

Add infrastructure for L3 cache-aware scheduling without changing scheduler behavior: - Add l3_aware.bpf.h with L3 topology helpers and work stealing logic - Add mitosis.bpf.h with core data structures and cleanup guards - Extend struct cell with per-L3 vtime tracking and spinlock - Add mitosis_topology_utils.rs to populate CPU-to-L3 topology maps - Add validate_flags() for runtime configuration checks - Refactor vtime updates into advance_cell_and_cpu_vtime() - Split intf.h for BPF vs userspace (add intf_rust.h for bindgen) - Add enable_l3_awareness and enable_work_stealing flags (unused) No scheduler behavior changes - select_cpu/enqueue/dispatch unchanged.

- Move task_ctx struct to mitosis.bpf.h with L3-aware fields - Add NR_COUNTERS enum to intf.h for function counters - Move some map/struct definitions from mitosis.bpf.c to l3_aware.bpf.h - Convert cells array explicit BPF map for proper locking - Add apply_pending_l3_retag() stub for cross-L3 steal handling - Update advance_cell_and_cpu_vtime() to accept task context

Implements L3 cache-aware task assignment and work stealing across L3 domains within cells. Tasks are assigned to CPUs within specific L3 cache domains to improve locality, with cross-L3 stealing when local queues are empty. Introduces two flags to control the functionality: - enable_l3_awareness: Enables L3-aware task assignment and per-L3 vtime tracking - enable_work_stealing: Enables cross-L3 work stealing when enabled alongside L3 awareness Key changes: - L3 assignment on task wakeup via pick_l3_for_task() - Per-(cell,L3) vtime tracking replaces global cell vtime when L3 awareness is enabled - Task cpumask restriction to CPUs within assigned L3 domain - Work stealing in dispatch() when local L3 queue is empty - Refactored update_task_cpumask() with dedicated L3-aware path in update_task_l3_assignment() - Cell reconfiguration now recalculates L3 CPU counts after applying new CPU assignments - Added function counters and steal statistics for observability - Per-(cell,L3) DSQs created at initialization instead of single cell-wide DSQ

Remove the deprecated cell->vtime_now field and use l3_vtime_now[FAKE_FLAT_CELL_L3] for non-L3-aware mode. This unifies vtime tracking under a single data structure regardless of whether L3 awareness is enabled. - Rename DUMMY_L3 to FAKE_FLAT_CELL_L3 with explanatory comment - Remove vtime_now from struct cell, re-enable static assertions - Clean up apply_pending_l3_retag(): remove stale vtime save/restore comments and clarify that update_task_cpumask sets the new vtime - Document benign race in work stealing l3_cpu_cnt check

- Make steal throttling start-time configurable - Rename l3->llc / L3->LLC for generality - Drop unused ravg_impl - Misc formatting

per-cell - Add llc field to cpu_ctx to avoid map lookup in dispatch hot path - Initialize cpu_ctx->llc during scheduler init from cpu_to_llc map - Replace CSTAT_STEAL as a cell stat, removing the global steal_stats map

likewhatevs · 2025-12-22T17:40:31Z

scheds/rust/scx_mitosis/src/bpf/llc_aware.bpf.h

+
+// A CPU -> LLC cache ID map
+struct cpu_to_llc_map {
+	__uint(type, BPF_MAP_TYPE_ARRAY);


should this be percpu? i vaguely recall that maybe mattering w/ layered. not so much this one in particular (although it looks like it could be), but iiuc more/less that which can be percpu array is best if it is.

Hmm, for cacheing purposes? I think this should be fine because after init it's read only & thus trivially cacheable. Maybe you have other concerns?

dschatzberg

I didn't quite get through half of this. It's a lot to review all at once. Can you split out the work stealing related bits and have this PR only cover LLC locality? I'll keep working through this but in general the quality of my reviews is inversely proportional to the amount of code in the PR.

dschatzberg · 2025-12-23T15:02:39Z