-
Notifications
You must be signed in to change notification settings - Fork 212
scx_mitosis: Add Optional LLC Awareness and Work Stealing #3146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Convert DSQ identifiers from raw u32 to typed dsq_id_t to prepare for L3 cache awareness. Replace old helper functions (cpu_dsq, cell_dsq, dsq_to_cpu) with new typed versions (get_cpu_dsq_id, get_cell_l3_dsq_id, get_cpu_from_dsq). L3 is hardcoded to DUMMY_L3 (0) for now, so scheduler behavior is unchanged. This validates the DSQ encoding mechanism before adding actual L3 awareness in subsequent commits. Changes: - Add MAX_L3S constant to intf.h - Include dsq.bpf.h for typed DSQ helpers - Convert struct task_ctx.dsq from u32 to dsq_id_t - Update all DSQ operations to use .raw when calling BPF functions - Create DSQs using get_cell_l3_dsq_id(cell, DUMMY_L3) instead of raw cell index
Add infrastructure for L3 cache-aware scheduling without changing scheduler behavior: - Add l3_aware.bpf.h with L3 topology helpers and work stealing logic - Add mitosis.bpf.h with core data structures and cleanup guards - Extend struct cell with per-L3 vtime tracking and spinlock - Add mitosis_topology_utils.rs to populate CPU-to-L3 topology maps - Add validate_flags() for runtime configuration checks - Refactor vtime updates into advance_cell_and_cpu_vtime() - Split intf.h for BPF vs userspace (add intf_rust.h for bindgen) - Add enable_l3_awareness and enable_work_stealing flags (unused) No scheduler behavior changes - select_cpu/enqueue/dispatch unchanged.
- Move task_ctx struct to mitosis.bpf.h with L3-aware fields - Add NR_COUNTERS enum to intf.h for function counters - Move some map/struct definitions from mitosis.bpf.c to l3_aware.bpf.h - Convert cells array explicit BPF map for proper locking - Add apply_pending_l3_retag() stub for cross-L3 steal handling - Update advance_cell_and_cpu_vtime() to accept task context
Implements L3 cache-aware task assignment and work stealing across L3 domains within cells. Tasks are assigned to CPUs within specific L3 cache domains to improve locality, with cross-L3 stealing when local queues are empty. Introduces two flags to control the functionality: - enable_l3_awareness: Enables L3-aware task assignment and per-L3 vtime tracking - enable_work_stealing: Enables cross-L3 work stealing when enabled alongside L3 awareness Key changes: - L3 assignment on task wakeup via pick_l3_for_task() - Per-(cell,L3) vtime tracking replaces global cell vtime when L3 awareness is enabled - Task cpumask restriction to CPUs within assigned L3 domain - Work stealing in dispatch() when local L3 queue is empty - Refactored update_task_cpumask() with dedicated L3-aware path in update_task_l3_assignment() - Cell reconfiguration now recalculates L3 CPU counts after applying new CPU assignments - Added function counters and steal statistics for observability - Per-(cell,L3) DSQs created at initialization instead of single cell-wide DSQ
Remove the deprecated cell->vtime_now field and use
l3_vtime_now[FAKE_FLAT_CELL_L3] for non-L3-aware mode. This unifies
vtime tracking under a single data structure regardless of whether
L3 awareness is enabled.
- Rename DUMMY_L3 to FAKE_FLAT_CELL_L3 with explanatory comment
- Remove vtime_now from struct cell, re-enable static assertions
- Clean up apply_pending_l3_retag(): remove stale vtime save/restore
comments and clarify that update_task_cpumask sets the new vtime
- Document benign race in work stealing l3_cpu_cnt check
- Make steal throttling start-time configurable - Rename l3->llc / L3->LLC for generality - Drop unused ravg_impl - Misc formatting
fc203be to
a337e86
Compare
per-cell - Add llc field to cpu_ctx to avoid map lookup in dispatch hot path - Initialize cpu_ctx->llc during scheduler init from cpu_to_llc map - Replace CSTAT_STEAL as a cell stat, removing the global steal_stats map
a337e86 to
c7e8d09
Compare
|
|
||
| // A CPU -> LLC cache ID map | ||
| struct cpu_to_llc_map { | ||
| __uint(type, BPF_MAP_TYPE_ARRAY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be percpu? i vaguely recall that maybe mattering w/ layered. not so much this one in particular (although it looks like it could be), but iiuc more/less that which can be percpu array is best if it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, for cacheing purposes? I think this should be fine because after init it's read only & thus trivially cacheable. Maybe you have other concerns?
dschatzberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't quite get through half of this. It's a lot to review all at once. Can you split out the work stealing related bits and have this PR only cover LLC locality? I'll keep working through this but in general the quality of my reviews is inversely proportional to the amount of code in the PR.
|
|
||
| #if __BYTE_ORDER__ != __ORDER_LITTLE_ENDIAN__ | ||
| #error "dsq64 bitfield layout assumes little-endian (bpfel)." | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth making this work for big endian machines relatively soon
| u64 rsvd : 30; | ||
| u64 local_on : 1; | ||
| u64 builtin : 1; | ||
| } builtin_dsq; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past I've seen gcc/llvm produce really unoptimized code with these bitfields. Stylistically this differs quite a bit from how the Linux kernel typically uses bitfields (instead relying on macros and shifts/masks). I think I'd prefer aligning with the approach used in the kernel but I don't feel too strongly - just take a closer look here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. take a look at https://github.com/torvalds/linux/blob/master/include/linux/bitfield.h for how the linux kernel helpers look here - might be worth replicating that
| static inline u32 get_cpu_from_dsq(dsq_id_t dsq_id) | ||
| { | ||
| if (!is_cpu_dsq(dsq_id)) | ||
| scx_bpf_error("trying to get cpu from non-cpu dsq\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should print the dsq getting passed in here and we also need some return value to indicate that callers here should error out
| { | ||
| // Check for valid CPU range, 0 indexed so >=. | ||
| if (cpu >= MAX_CPUS) | ||
| scx_bpf_error("invalid cpu %u\n", cpu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, we need a way to error out of a caller - scx_bpf_error doesn't just halt the program it gets checked after the entire bpf call returns
| static inline dsq_id_t get_cell_l3_dsq_id(u32 cell, u32 l3) | ||
| { | ||
| if (cell >= MAX_CELLS || l3 >= MAX_L3S) | ||
| scx_bpf_error("cell %u or l3 %u too large\n", cell, l3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here again
| MAX_CG_DEPTH = 256, | ||
| MAX_L3S = 16, | ||
|
|
||
| MITOSIS_ENABLE_STEALING = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid having compile-time configuration. I really think we want to enable features at runtime so we can ship a single binary and just dynamically choose the feature set we enable
| struct { \ | ||
| u32 __pad; \ | ||
| } /* 4-byte aligned as required */ | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit of code probably requires some inline comment explaining why its needed - very confusing to me in isolation
| u64 l3_vtime_now[MAX_L3S]; | ||
|
|
||
| // XXX Remove me when shift to llc is complete. | ||
| u64 vtime_now; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatting here looks a bit odd
| u32 l3_cpu_cnt[MAX_L3S]; | ||
|
|
||
| // per-L3 vtimes within this cell | ||
| u64 l3_vtime_now[MAX_L3S]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you would rather have a new struct cell_llc that has the number of CPUs and the vtimes in it - you may even want to cacheline align such structs to reduce cache ping-ponging
| /* | ||
| * Legacy approach: Update vtime_now before task runs. | ||
| * Only used when split vtime updates is enabled. | ||
| * Expect "vtime too far ahead" errors if this is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just remove the flag? Seems like we're confident the code works so no reason to make it conditional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree it should be eliminated. I assumed this would need to be done in a separate PR.
Summary
Mitosis previously wasn't topology aware, dispatching tasks to a single DSQ per cell. This work adds
LLC (Last Level Cache) domain awareness to scx_mitosis. When enabled via
--enable-llc-awareness,tasks are assigned to LLC domains within their cell and preferentially scheduled on CPUs sharing that LLC.
Cross-LLC work stealing allows for some degree of load balancing and work conservation, with
configurable throttling to prevent excessive thread migrations.
Examples
Changes
BPF Side
struct dsq_typeand encoding helpers to create unique DSQ IDs that embed both cell index and LLC index, enabling per-(cell, LLC) dispatch queues.llc_vtime_now[MAX_LLCS]for fair scheduling within LLC domains, replacing the single cell-wide vtime.running()call. Stealing is throttled via--steal-throttleto prevent excessive cross-LLC migrations.cpu_ctx->llcduring init to avoid map lookups in the hot dispatch path.CSTAT_STEALin per-cell cstats.llc_aware.bpf.hwithmitosis.bpf.hproviding shared types and utilities.Rust Side
cpu_to_llcandllc_to_cpusmaps from system topology before BPF init.--enable-llc-awareness,--enable-work-stealing, and--steal-throttleflags to control the feature.recalc_cell_llc_counts()after cell cpumask changes to update per-LLC CPU counts.Unchanged Behavior
When
--enable-llc-awarenessis not set:FAKE_FLAT_CELL_LLC(0) is used for DSQ encoding