Skip to content

Conversation

@tommy-u
Copy link
Contributor

@tommy-u tommy-u commented Dec 12, 2025

Summary

Mitosis previously wasn't topology aware, dispatching tasks to a single DSQ per cell. This work adds
LLC (Last Level Cache) domain awareness to scx_mitosis. When enabled via --enable-llc-awareness,
tasks are assigned to LLC domains within their cell and preferentially scheduled on CPUs sharing that LLC.
Cross-LLC work stealing allows for some degree of load balancing and work conservation, with
configurable throttling to prevent excessive thread migrations.

Examples

# Original behavior, no LLC awareness
scx_mitosis

# LLC-aware scheduling, no work stealing
scx_mitosis --enable-llc-awareness

# LLC-aware with work stealing (recommended)
scx_mitosis --enable-llc-awareness --enable-work-stealing

# LLC-aware with throttled stealing (skip N steal opportunities per task)
scx_mitosis --enable-llc-awareness --enable-work-stealing --steal-throttle 2

Changes

BPF Side

  • DSQ typed encoding: Introduced struct dsq_type and encoding helpers to create unique DSQ IDs that embed both cell index and LLC index, enabling per-(cell, LLC) dispatch queues.
  • LLC-aware task assignment: Tasks are assigned to an LLC within their cell using weighted random selection based on CPU count per LLC. The task's cpumask is narrowed to CPUs in its assigned LLC domain.
  • Per-LLC vtime tracking: Each cell now maintains llc_vtime_now[MAX_LLCS] for fair scheduling within LLC domains, replacing the single cell-wide vtime.
  • Work stealing: Idle CPUs can steal tasks from sibling LLC DSQs within the same cell. Stolen tasks are retagged to the thief's LLC on next running() call. Stealing is throttled via --steal-throttle to prevent excessive cross-LLC migrations.
  • CPU context caching: The LLC index is cached in cpu_ctx->llc during init to avoid map lookups in the hot dispatch path.
  • Steal statistics: Steal events are tracked as CSTAT_STEAL in per-cell cstats.
  • Header restructuring: LLC-aware logic is isolated in llc_aware.bpf.h with mitosis.bpf.h providing shared types and utilities.

Rust Side

  • Topology initialization: Populates cpu_to_llc and llc_to_cpus maps from system topology before BPF init.
  • CLI options: Added --enable-llc-awareness, --enable-work-stealing, and --steal-throttle flags to control the feature.
  • Cell LLC recalculation: Calls recalc_cell_llc_counts() after cell cpumask changes to update per-LLC CPU counts.

Unchanged Behavior

When --enable-llc-awareness is not set:

  • Tasks dispatch to flat per-cell DSQs as before
  • No LLC domain assignment or cross-LLC stealing occurs
  • FAKE_FLAT_CELL_LLC (0) is used for DSQ encoding

  Convert DSQ identifiers from raw u32 to typed dsq_id_t to prepare for
  L3 cache awareness. Replace old helper functions (cpu_dsq, cell_dsq,
  dsq_to_cpu) with new typed versions (get_cpu_dsq_id, get_cell_l3_dsq_id,
  get_cpu_from_dsq).

  L3 is hardcoded to DUMMY_L3 (0) for now, so scheduler behavior is
  unchanged. This validates the DSQ encoding mechanism before adding
  actual L3 awareness in subsequent commits.

  Changes:
  - Add MAX_L3S constant to intf.h
  - Include dsq.bpf.h for typed DSQ helpers
  - Convert struct task_ctx.dsq from u32 to dsq_id_t
  - Update all DSQ operations to use .raw when calling BPF functions
  - Create DSQs using get_cell_l3_dsq_id(cell, DUMMY_L3) instead of raw cell index
  Add infrastructure for L3 cache-aware scheduling without changing
  scheduler behavior:

  - Add l3_aware.bpf.h with L3 topology helpers and work stealing logic
  - Add mitosis.bpf.h with core data structures and cleanup guards
  - Extend struct cell with per-L3 vtime tracking and spinlock
  - Add mitosis_topology_utils.rs to populate CPU-to-L3 topology maps
  - Add validate_flags() for runtime configuration checks
  - Refactor vtime updates into advance_cell_and_cpu_vtime()
  - Split intf.h for BPF vs userspace (add intf_rust.h for bindgen)
  - Add enable_l3_awareness and enable_work_stealing flags (unused)

  No scheduler behavior changes - select_cpu/enqueue/dispatch unchanged.
  - Move task_ctx struct to mitosis.bpf.h with L3-aware fields
  - Add NR_COUNTERS enum to intf.h for function counters
  - Move some map/struct definitions from mitosis.bpf.c to l3_aware.bpf.h
  - Convert cells array explicit BPF map for proper locking
  - Add apply_pending_l3_retag() stub for cross-L3 steal handling
  - Update advance_cell_and_cpu_vtime() to accept task context
  Implements L3 cache-aware task assignment and work stealing across L3 domains within cells. Tasks are
  assigned to CPUs within specific L3 cache domains to improve locality, with cross-L3 stealing when
  local queues are empty.

  Introduces two flags to control the functionality:
  - enable_l3_awareness: Enables L3-aware task assignment and per-L3 vtime tracking
  - enable_work_stealing: Enables cross-L3 work stealing when enabled alongside L3 awareness

  Key changes:
  - L3 assignment on task wakeup via pick_l3_for_task()
  - Per-(cell,L3) vtime tracking replaces global cell vtime when L3 awareness is enabled
  - Task cpumask restriction to CPUs within assigned L3 domain
  - Work stealing in dispatch() when local L3 queue is empty
  - Refactored update_task_cpumask() with dedicated L3-aware path in update_task_l3_assignment()
  - Cell reconfiguration now recalculates L3 CPU counts after applying new CPU assignments
  - Added function counters and steal statistics for observability
  - Per-(cell,L3) DSQs created at initialization instead of single cell-wide DSQ
  Remove the deprecated cell->vtime_now field and use
  l3_vtime_now[FAKE_FLAT_CELL_L3] for non-L3-aware mode. This unifies
  vtime tracking under a single data structure regardless of whether
  L3 awareness is enabled.

  - Rename DUMMY_L3 to FAKE_FLAT_CELL_L3 with explanatory comment
  - Remove vtime_now from struct cell, re-enable static assertions
  - Clean up apply_pending_l3_retag(): remove stale vtime save/restore
    comments and clarify that update_task_cpumask sets the new vtime
  - Document benign race in work stealing l3_cpu_cnt check
  - Make steal throttling start-time configurable
  - Rename l3->llc / L3->LLC for generality
  - Drop unused ravg_impl
  - Misc formatting
@tommy-u tommy-u force-pushed the scx_mitosis_llc_aware branch 2 times, most recently from fc203be to a337e86 Compare December 12, 2025 02:01
  per-cell

  - Add llc field to cpu_ctx to avoid map lookup in dispatch hot
  path
  - Initialize cpu_ctx->llc during scheduler init from cpu_to_llc
  map
  - Replace CSTAT_STEAL as a cell stat, removing the global
  steal_stats map

// A CPU -> LLC cache ID map
struct cpu_to_llc_map {
__uint(type, BPF_MAP_TYPE_ARRAY);
Copy link
Contributor

@likewhatevs likewhatevs Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be percpu? i vaguely recall that maybe mattering w/ layered. not so much this one in particular (although it looks like it could be), but iiuc more/less that which can be percpu array is best if it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for cacheing purposes? I think this should be fine because after init it's read only & thus trivially cacheable. Maybe you have other concerns?

Copy link
Contributor

@dschatzberg dschatzberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite get through half of this. It's a lot to review all at once. Can you split out the work stealing related bits and have this PR only cover LLC locality? I'll keep working through this but in general the quality of my reviews is inversely proportional to the amount of code in the PR.


#if __BYTE_ORDER__ != __ORDER_LITTLE_ENDIAN__
#error "dsq64 bitfield layout assumes little-endian (bpfel)."
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth making this work for big endian machines relatively soon

u64 rsvd : 30;
u64 local_on : 1;
u64 builtin : 1;
} builtin_dsq;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past I've seen gcc/llvm produce really unoptimized code with these bitfields. Stylistically this differs quite a bit from how the Linux kernel typically uses bitfields (instead relying on macros and shifts/masks). I think I'd prefer aligning with the approach used in the kernel but I don't feel too strongly - just take a closer look here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. take a look at https://github.com/torvalds/linux/blob/master/include/linux/bitfield.h for how the linux kernel helpers look here - might be worth replicating that

static inline u32 get_cpu_from_dsq(dsq_id_t dsq_id)
{
if (!is_cpu_dsq(dsq_id))
scx_bpf_error("trying to get cpu from non-cpu dsq\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should print the dsq getting passed in here and we also need some return value to indicate that callers here should error out

{
// Check for valid CPU range, 0 indexed so >=.
if (cpu >= MAX_CPUS)
scx_bpf_error("invalid cpu %u\n", cpu);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, we need a way to error out of a caller - scx_bpf_error doesn't just halt the program it gets checked after the entire bpf call returns

static inline dsq_id_t get_cell_l3_dsq_id(u32 cell, u32 l3)
{
if (cell >= MAX_CELLS || l3 >= MAX_L3S)
scx_bpf_error("cell %u or l3 %u too large\n", cell, l3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here again

MAX_CG_DEPTH = 256,
MAX_L3S = 16,

MITOSIS_ENABLE_STEALING = 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid having compile-time configuration. I really think we want to enable features at runtime so we can ship a single binary and just dynamically choose the feature set we enable

struct { \
u32 __pad; \
} /* 4-byte aligned as required */
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of code probably requires some inline comment explaining why its needed - very confusing to me in isolation

u64 l3_vtime_now[MAX_L3S];

// XXX Remove me when shift to llc is complete.
u64 vtime_now;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting here looks a bit odd

u32 l3_cpu_cnt[MAX_L3S];

// per-L3 vtimes within this cell
u64 l3_vtime_now[MAX_L3S];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you would rather have a new struct cell_llc that has the number of CPUs and the vtimes in it - you may even want to cacheline align such structs to reduce cache ping-ponging

/*
* Legacy approach: Update vtime_now before task runs.
* Only used when split vtime updates is enabled.
* Expect "vtime too far ahead" errors if this is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just remove the flag? Seems like we're confident the code works so no reason to make it conditional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree it should be eliminated. I assumed this would need to be done in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants