- Fix P15-8-init-cycle-detection.patch: replace visiting+error with seen+silent-skip to eliminate 11 false-positive 'dependency cycle detected' errors on shared deps - Fix P0-daemon-fix-init-notify-unwrap.patch: remove eprintln! for missing INIT_NOTIFY (expected for oneshot_async services, ~7 daemons affected) - Fix driver-manager hotplug loop: add PERMANENTLY_SKIPPED static set shared between hotplug handler and DriverConfig::probe() to stop infinite re-probing of Fatal/NotSupported/deferred-exhausted device+driver pairs (e.g. ided) - Fix driver-manager log_timeline: suppress repeated EPIPE/ENOENT errors with AtomicI32 dedup and AtomicBool one-shot guards for boot timeline JSON - Add driver-manager SIGTERM handler, ACPI bus registration, --status mode, driver reap loop, graceful shutdown, and reduced deferred retries (30→3)
14 KiB
SMP/Scheduler Improvement Plan
Status: Active
Date: 2026-05-16
Authority: Canonical execution plan for SMP hardening
Priority order: Bottleneck #1 → #2 → #3 → #4 → #5 → #6 → #7
Context
Red Bear OS kernel has functional SMP (multi-core) support with x2APIC, per-CPU run queues, work stealing, and DWRR+vruntime scheduling. However, several design choices limit scalability on systems with more than 2-4 cores. This plan addresses the seven identified bottlenecks in priority order.
Reference Sources
- Linux 7.1:
local/reference/linux-7.1/— scheduler, TLB, IRQ affinity - seL4:
local/reference/seL4/— lock-free kernel structures, minimal context switch - Zircon: online reference only (download failed due to network issues)
Key Kernel Files
| File | Lines | Purpose |
|---|---|---|
context/switch.rs |
577 | Scheduler, context switch, DWRR |
context/arch/x86_64.rs |
395 | CONTEXT_SWITCH_LOCK, switch_to, register save/restore |
percpu.rs |
205 | PercpuBlock, TLB shootdown |
arch/x86_shared/ipi.rs |
53 | IPI kinds and dispatch |
arch/x86_shared/device/local_apic.rs |
312 | x2APIC/xAPIC, ICR programming |
arch/x86_shared/device/ioapic.rs |
476 | IOAPIC, IRQ routing, MapInfo |
acpi/madt/arch/x86.rs |
354 | AP startup, SIPI, trampoline |
Red Bear Patches (already applied)
| Patch | Lines | Purpose |
|---|---|---|
| P8-percpu-sched | 123 | Per-CPU scheduler queues |
| P8-percpu-wiring | 985 | Work stealing, load balancing, vruntime, affinity |
| P8-work-stealing | 190 | Steal statistics, migration helpers |
| P8-msi | 281 | MSI/MSI-X foundation, vector allocation |
| P8-msi-foundation-v2 | 188 | MSI refinement |
Bottleneck #1: Global CONTEXT_SWITCH_LOCK
Severity: 🔴 Critical — serializes ALL context switches across ALL CPUs
Files: context/arch/x86_64.rs:19, context/switch.rs:123,156,171,296
Current State
// context/arch/x86_64.rs:19
pub static CONTEXT_SWITCH_LOCK: AtomicBool = AtomicBool::new(false);
// context/switch.rs:156-162
while arch::CONTEXT_SWITCH_LOCK
.compare_exchange_weak(false, true, Ordering::SeqCst, Ordering::Relaxed)
.is_err()
{
hint::spin_loop();
percpu.maybe_handle_tlb_shootdown();
}
Every context switch on ANY CPU must acquire this single global lock. On an 8-core system, 7 CPUs spin-wait while one CPU performs its context switch. This is the primary SMP scalability limiter.
Why It Exists
The comment says: "Acquire the global lock to ensure exclusive access during context switch and avoid issues that would be caused by the unsafe operations below."
The concern is that during switch_to, the CPU is in a transitional state:
- Prev context's registers are saved
- Next context's registers are being loaded
- Stack pointer changes to next context's stack
switch_finish_hookruns to drop guards and release lock
During steps 1-4, another CPU switching to the same context could cause data races.
Analysis: The Lock Is Overly Conservative
The per-context write locks (Arc<ContextLock>) already prevent concurrent access to the
same context. The switch() function:
- Locks prev context (write) — prevents anyone else from modifying it
- Locks next context (write) — prevents anyone else from modifying it
- Updates running flags and CPU IDs (under both write locks)
- Stores both guards in
switch_result(kept alive untilswitch_finish_hook) - Calls
arch::switch_to(register swap)
If CPU 0 holds write locks on contexts A and B, CPU 1 cannot lock either A or B. The global lock adds nothing for correctness — it only serializes independent switches involving completely different contexts.
Fix: Per-CPU Flag in PercpuBlock
Replace the global AtomicBool with a per-CPU flag in ContextSwitchPercpu:
// percpu.rs — add to ContextSwitchPercpu
pub in_context_switch: AtomicBool,
The per-CPU flag:
- Each CPU acquires its own flag before switching — zero cross-CPU contention
- Debug assertion catches re-entrant switches on the same CPU
- Released in
switch_finish_hookas before
Implementation Steps
- Add
in_context_switch: Cell<bool>toContextSwitchPercpuinswitch.rs - Remove
CONTEXT_SWITCH_LOCKfromcontext/arch/x86_64.rs(and aarch64, riscv64) - Replace
arch::CONTEXT_SWITCH_LOCK.compare_exchange_weak(...)with per-CPU flag check - Replace
arch::CONTEXT_SWITCH_LOCK.store(false, ...)with per-CPU flag release - Update
switch_finish_hookaccordingly - Rebuild kernel, verify boot
Risk Assessment
- Low risk: The per-CPU flag is structurally equivalent to the global lock for each CPU. The global lock's only effect was preventing concurrent switches on different CPUs, which is unnecessary given per-context write locks.
- Safety net: Keep the per-CPU flag as a debug assertion. If re-entrant switching is detected, panic instead of corrupting state.
Bottleneck #2: No Broadcast TLB Shootdown
Severity: 🔴 Critical — O(N) shootdown on N CPUs, each with individual IPI
Files: percpu.rs:75-113, ipi.rs:22-38
Current State
// percpu.rs:106-112 — shootdown_tlb_ipi when target is None (broadcast)
for id in 0..crate::cpu_count() {
// TODO: Optimize: use global counter and percpu ack counters, send IPI using
// destination shorthand "all CPUs".
shootdown_tlb_ipi(Some(LogicalCpuId::new(id)));
}
Broadcast TLB shootdown is implemented as a loop, sending individual IPIs to each CPU. Each IPI requires:
- Set
wants_tlb_shootdownflag on target CPU - Spin-wait if previous shootdown is still pending
- Send IPI via
ipi_single() - Target CPU processes IPI in interrupt handler
On a 128-core system, this means 127 individual IPI sends, each with spin-wait overhead.
Fix: x2APIC Destination Shorthand
The Local APIC supports destination shorthands in the ICR:
01b= "self" (Current)10b= "all including self" (All)11b= "all except self" (Other)
The IpiTarget enum already defines these values (ipi.rs:15-19), and the ipi() function
(ipi.rs:22-38) already supports them. We just need to use IpiTarget::Other for broadcast
TLB shootdowns.
Implementation Steps
- Add
tlb_shootdown_pending: AtomicU32ACK counter toPercpuBlock - Add global
TLB_SHOOTDOWN_GENERATION: AtomicU32counter - In
shootdown_tlb_ipi(None):- Increment generation counter
- Set
wants_tlb_shootdownon all CPUs (lock-free) - Send single IPI with
IpiTarget::Othershorthand
- In
maybe_handle_tlb_shootdown():- Process shootdown
- Increment ACK counter
- Add
wait_for_tlb_acknowledgments()with timeout - Rebuild kernel, verify boot
x2APIC ICR Format
For x2APIC, the ICR is a single 64-bit MSR write:
Bits 63:32 = Destination APIC ID (ignored for shorthands)
Bits 19:18 = Destination Shorthand (0=none, 1=self, 2=all, 3=all-except-self)
Bit 14 = Trigger Mode (0=edge, 1=level)
Bits 11:8 = Delivery Mode (0=fixed)
Bits 7:0 = Vector
For "all except self" broadcast with TLB vector (0x41):
let icr = (3u64 << 18) | (1 << 14) | (IpiKind::Tlb as u64);
// = 0x000C0000_00000041
Bottleneck #3: IRQ Affinity Not Wired to IOAPIC
Severity: 🟡 Medium — stored but never applied to hardware
Files: ioapic.rs, MSI patches P8-msi.patch
Current State
The IOAPIC MapInfo struct has a dest: ApicId field, and DestinationMode enum has
Logical variant. However:
- No
set_affinity()function — there's no way to reprogram an IOAPIC redirection entry to change its destination APIC - Legacy IRQs all route to BSP —
init()hardcodesbsp_apic_idas destination - MSI patches store affinity —
P8-msi.patchaddsset_irq_affinity()API but doesn't reprogram IOAPIC hardware
Fix: Add IOAPIC IRQ Affinity
Add a function to reprogram the IOAPIC redirection table entry:
impl IoApic {
pub fn set_irq_affinity(&self, gsi: u32, dest: ApicId) -> bool {
let idx = (gsi - self.gsi_start) as u8;
let mut guard = self.regs.lock();
let Some(mut entry) = guard.read_ioredtbl(idx) else {
return false;
};
// Clear destination (bits 63:56 for xAPIC, bits 63:32 for x2APIC)
// xAPIC: destination is bits 63:56
entry &= !(0xFF << 56);
entry |= u64::from(dest.get()) << 56;
guard.write_ioredtbl(idx, entry)
}
}
Add a public API to find the right IOAPIC and call it:
pub fn set_affinity(irq: u8, dest: ApicId) {
let gsi = resolve(irq);
if let Some(apic) = find_ioapic(gsi) {
apic.set_irq_affinity(gsi, dest);
}
}
Implementation Steps
- Add
IoApic::set_irq_affinity()method - Add
ioapic::set_affinity()public function - Wire into kernel IRQ scheme
set_affinityhandler - Add round-robin or numa-aware default affinity for new IRQs
- Rebuild kernel, verify boot
Bottleneck #4: Simple Spinlocks for Scheduler Queues
Severity: 🟡 Medium — unfair under contention
Files: context/switch.rs (run_contexts access)
Current State
Per-CPU run queues use spin::Mutex (simple spinlock). Under contention:
- No fairness guarantee — a CPU may spin indefinitely
- No backoff — constant cache line bouncing
- No NUMA awareness — cross-socket contention is expensive
Fix: MCS Lock or Try-Lock with Backoff
Replace spin::Mutex with an MCS lock (John Mellor-Crummey and Michael Scott):
- Each waiter spins on a local flag (cache-line friendly)
- FIFO ordering guarantees fairness
- O(1) cache line transfers on unlock
Alternatively, since per-CPU queues should have low contention:
- Use
try_lock()with exponential backoff - Fall back to global queue if per-CPU queue is contended
Implementation Steps
- Implement MCS lock primitive in
sync/ - Replace
spin::Mutexin run queue access - Add contention statistics to
PercpuBlock - Rebuild kernel, verify boot
Bottleneck #5: No NUMA Topology Awareness
Severity: 🟡 Medium — treats all CPUs and memory as uniform
Files: acpi/madt/mod.rs, percpu.rs
Current State
- No SRAT parsing (NUMA proximity domains)
- No SLIT parsing (NUMA distance matrix)
- Work stealing is random — may steal from a remote socket
- Memory allocation is uniform — no preference for local node
Fix: ACPI SRAT + SLIT Parsing
- Parse SRAT (System Resource Affinity Table) for CPU-to-node mapping
- Parse SLIT (System Locality Information Table) for distance matrix
- Add
numa_node: u32toPercpuBlock - Prefer stealing from same-socket CPUs
- Prefer allocating memory from local node
Implementation Steps
- Add SRAT/SLIT table parsing in
acpi/ - Extend
PercpuBlockwith NUMA info - Update work stealing to prefer local node
- Update memory allocator with NUMA hints
- Rebuild kernel, verify boot
Bottleneck #6: Coarse TLB Flush
Severity: 🟡 Low-Medium — full TLB flush instead of range-based
Files: percpu.rs:122
Current State
// percpu.rs:122
crate::memory::RmmA::invalidate_all();
Every TLB shootdown flushes the entire TLB, even when only a single page changed. Full TLB flush is extremely expensive on modern CPUs with large TLBs.
Fix: Range-Based and Single-Page Invalidation
Use x86 INVLPG for single-page invalidation:
// For single page:
x86::tlb::flush(page);
// For range:
for page in range.step_by(PAGE_SIZE) {
x86::tlb::flush(page);
}
// Only use full flush for large ranges (> 32 pages)
Implementation Steps
- Add
shootdown_range(start: Page, count: usize)to percpu - Store shootdown range in
PercpuBlockalongside flag - Replace
invalidate_all()with conditional INVLPG - Fall back to full flush for large ranges (> 32 pages) or PCID flush
- Rebuild kernel, verify boot
Bottleneck #7: No Priority Inheritance
Severity: 🟡 Low — mutex priority inversion possible
Files: sync/ (various lock primitives)
Current State
No priority inheritance protocol. A low-priority thread holding a mutex can be preempted, causing a high-priority thread waiting on the same mutex to block indefinitely (priority inversion).
Fix: Priority Inheritance for Mutexes
Implement the Basic Priority Inheritance Protocol (PI):
- When a thread blocks on a mutex, donate its priority to the mutex holder
- When the mutex is released, restore the original priority
- Support multiple donors (priority queue of donors)
Implementation Steps
- Add
donated_priority: Option<usize>toContext - Implement priority donation in mutex lock acquisition
- Implement priority restoration in mutex unlock
- Add debug assertions to detect inversion
- Rebuild kernel, verify boot
Execution Timeline
| Phase | Bottleneck | Duration | Dependencies |
|---|---|---|---|
| 1 | #1 CONTEXT_SWITCH_LOCK | 1-2 days | None |
| 2 | #2 Broadcast TLB shootdown | 1-2 days | Phase 1 (per-CPU flags) |
| 3 | #3 IOAPIC IRQ affinity | 1-2 days | None |
| 4 | #4 MCS locks | 2-3 days | Phase 1 (reduced contention) |
| 5 | #6 Range TLB flush | 1 day | Phase 2 (shootdown infrastructure) |
| 6 | #5 NUMA awareness | 3-5 days | Phase 4 (scheduler queues) |
| 7 | #7 Priority inheritance | 2-3 days | None |
Total estimate: 11-18 days
Patch Naming Convention
New kernel patches following this plan:
P9-percpu-context-switch.patch— Bottleneck #1P9-broadcast-tlb-shootdown.patch— Bottleneck #2P9-ioapic-irq-affinity.patch— Bottleneck #3P9-mcs-locks.patch— Bottleneck #4P9-range-tlb-flush.patch— Bottleneck #6P9-numa-awareness.patch— Bottleneck #5P9-priority-inheritance.patch— Bottleneck #7
All P9 patches must be applied after P8 patches in recipe.toml.