Files
RedBear-OS/local/docs/SMP-SCHEDULER-IMPROVEMENT-PLAN.md
T
vasilito cee25393d8 fix: boot process improvements — dependency cycle, INIT_NOTIFY, probing loop, and log spam fixes
- Fix P15-8-init-cycle-detection.patch: replace visiting+error with seen+silent-skip
  to eliminate 11 false-positive 'dependency cycle detected' errors on shared deps
- Fix P0-daemon-fix-init-notify-unwrap.patch: remove eprintln! for missing
  INIT_NOTIFY (expected for oneshot_async services, ~7 daemons affected)
- Fix driver-manager hotplug loop: add PERMANENTLY_SKIPPED static set shared
  between hotplug handler and DriverConfig::probe() to stop infinite re-probing
  of Fatal/NotSupported/deferred-exhausted device+driver pairs (e.g. ided)
- Fix driver-manager log_timeline: suppress repeated EPIPE/ENOENT errors with
  AtomicI32 dedup and AtomicBool one-shot guards for boot timeline JSON
- Add driver-manager SIGTERM handler, ACPI bus registration, --status mode,
  driver reap loop, graceful shutdown, and reduced deferred retries (30→3)
2026-05-17 12:34:02 +03:00

14 KiB

SMP/Scheduler Improvement Plan

Status: Active
Date: 2026-05-16
Authority: Canonical execution plan for SMP hardening
Priority order: Bottleneck #1 → #2 → #3 → #4 → #5 → #6 → #7

Context

Red Bear OS kernel has functional SMP (multi-core) support with x2APIC, per-CPU run queues, work stealing, and DWRR+vruntime scheduling. However, several design choices limit scalability on systems with more than 2-4 cores. This plan addresses the seven identified bottlenecks in priority order.

Reference Sources

  • Linux 7.1: local/reference/linux-7.1/ — scheduler, TLB, IRQ affinity
  • seL4: local/reference/seL4/ — lock-free kernel structures, minimal context switch
  • Zircon: online reference only (download failed due to network issues)

Key Kernel Files

File Lines Purpose
context/switch.rs 577 Scheduler, context switch, DWRR
context/arch/x86_64.rs 395 CONTEXT_SWITCH_LOCK, switch_to, register save/restore
percpu.rs 205 PercpuBlock, TLB shootdown
arch/x86_shared/ipi.rs 53 IPI kinds and dispatch
arch/x86_shared/device/local_apic.rs 312 x2APIC/xAPIC, ICR programming
arch/x86_shared/device/ioapic.rs 476 IOAPIC, IRQ routing, MapInfo
acpi/madt/arch/x86.rs 354 AP startup, SIPI, trampoline

Red Bear Patches (already applied)

Patch Lines Purpose
P8-percpu-sched 123 Per-CPU scheduler queues
P8-percpu-wiring 985 Work stealing, load balancing, vruntime, affinity
P8-work-stealing 190 Steal statistics, migration helpers
P8-msi 281 MSI/MSI-X foundation, vector allocation
P8-msi-foundation-v2 188 MSI refinement

Bottleneck #1: Global CONTEXT_SWITCH_LOCK

Severity: 🔴 Critical — serializes ALL context switches across ALL CPUs
Files: context/arch/x86_64.rs:19, context/switch.rs:123,156,171,296

Current State

// context/arch/x86_64.rs:19
pub static CONTEXT_SWITCH_LOCK: AtomicBool = AtomicBool::new(false);

// context/switch.rs:156-162
while arch::CONTEXT_SWITCH_LOCK
    .compare_exchange_weak(false, true, Ordering::SeqCst, Ordering::Relaxed)
    .is_err()
{
    hint::spin_loop();
    percpu.maybe_handle_tlb_shootdown();
}

Every context switch on ANY CPU must acquire this single global lock. On an 8-core system, 7 CPUs spin-wait while one CPU performs its context switch. This is the primary SMP scalability limiter.

Why It Exists

The comment says: "Acquire the global lock to ensure exclusive access during context switch and avoid issues that would be caused by the unsafe operations below."

The concern is that during switch_to, the CPU is in a transitional state:

  1. Prev context's registers are saved
  2. Next context's registers are being loaded
  3. Stack pointer changes to next context's stack
  4. switch_finish_hook runs to drop guards and release lock

During steps 1-4, another CPU switching to the same context could cause data races.

Analysis: The Lock Is Overly Conservative

The per-context write locks (Arc<ContextLock>) already prevent concurrent access to the same context. The switch() function:

  1. Locks prev context (write) — prevents anyone else from modifying it
  2. Locks next context (write) — prevents anyone else from modifying it
  3. Updates running flags and CPU IDs (under both write locks)
  4. Stores both guards in switch_result (kept alive until switch_finish_hook)
  5. Calls arch::switch_to (register swap)

If CPU 0 holds write locks on contexts A and B, CPU 1 cannot lock either A or B. The global lock adds nothing for correctness — it only serializes independent switches involving completely different contexts.

Fix: Per-CPU Flag in PercpuBlock

Replace the global AtomicBool with a per-CPU flag in ContextSwitchPercpu:

// percpu.rs — add to ContextSwitchPercpu
pub in_context_switch: AtomicBool,

The per-CPU flag:

  • Each CPU acquires its own flag before switching — zero cross-CPU contention
  • Debug assertion catches re-entrant switches on the same CPU
  • Released in switch_finish_hook as before

Implementation Steps

  1. Add in_context_switch: Cell<bool> to ContextSwitchPercpu in switch.rs
  2. Remove CONTEXT_SWITCH_LOCK from context/arch/x86_64.rs (and aarch64, riscv64)
  3. Replace arch::CONTEXT_SWITCH_LOCK.compare_exchange_weak(...) with per-CPU flag check
  4. Replace arch::CONTEXT_SWITCH_LOCK.store(false, ...) with per-CPU flag release
  5. Update switch_finish_hook accordingly
  6. Rebuild kernel, verify boot

Risk Assessment

  • Low risk: The per-CPU flag is structurally equivalent to the global lock for each CPU. The global lock's only effect was preventing concurrent switches on different CPUs, which is unnecessary given per-context write locks.
  • Safety net: Keep the per-CPU flag as a debug assertion. If re-entrant switching is detected, panic instead of corrupting state.

Bottleneck #2: No Broadcast TLB Shootdown

Severity: 🔴 Critical — O(N) shootdown on N CPUs, each with individual IPI
Files: percpu.rs:75-113, ipi.rs:22-38

Current State

// percpu.rs:106-112 — shootdown_tlb_ipi when target is None (broadcast)
for id in 0..crate::cpu_count() {
    // TODO: Optimize: use global counter and percpu ack counters, send IPI using
    // destination shorthand "all CPUs".
    shootdown_tlb_ipi(Some(LogicalCpuId::new(id)));
}

Broadcast TLB shootdown is implemented as a loop, sending individual IPIs to each CPU. Each IPI requires:

  1. Set wants_tlb_shootdown flag on target CPU
  2. Spin-wait if previous shootdown is still pending
  3. Send IPI via ipi_single()
  4. Target CPU processes IPI in interrupt handler

On a 128-core system, this means 127 individual IPI sends, each with spin-wait overhead.

Fix: x2APIC Destination Shorthand

The Local APIC supports destination shorthands in the ICR:

  • 01b = "self" (Current)
  • 10b = "all including self" (All)
  • 11b = "all except self" (Other)

The IpiTarget enum already defines these values (ipi.rs:15-19), and the ipi() function (ipi.rs:22-38) already supports them. We just need to use IpiTarget::Other for broadcast TLB shootdowns.

Implementation Steps

  1. Add tlb_shootdown_pending: AtomicU32 ACK counter to PercpuBlock
  2. Add global TLB_SHOOTDOWN_GENERATION: AtomicU32 counter
  3. In shootdown_tlb_ipi(None):
    • Increment generation counter
    • Set wants_tlb_shootdown on all CPUs (lock-free)
    • Send single IPI with IpiTarget::Other shorthand
  4. In maybe_handle_tlb_shootdown():
    • Process shootdown
    • Increment ACK counter
  5. Add wait_for_tlb_acknowledgments() with timeout
  6. Rebuild kernel, verify boot

x2APIC ICR Format

For x2APIC, the ICR is a single 64-bit MSR write:

Bits 63:32 = Destination APIC ID (ignored for shorthands)
Bits 19:18 = Destination Shorthand (0=none, 1=self, 2=all, 3=all-except-self)
Bit  14    = Trigger Mode (0=edge, 1=level)
Bits 11:8  = Delivery Mode (0=fixed)
Bits 7:0   = Vector

For "all except self" broadcast with TLB vector (0x41):

let icr = (3u64 << 18) | (1 << 14) | (IpiKind::Tlb as u64);
// = 0x000C0000_00000041

Bottleneck #3: IRQ Affinity Not Wired to IOAPIC

Severity: 🟡 Medium — stored but never applied to hardware
Files: ioapic.rs, MSI patches P8-msi.patch

Current State

The IOAPIC MapInfo struct has a dest: ApicId field, and DestinationMode enum has Logical variant. However:

  1. No set_affinity() function — there's no way to reprogram an IOAPIC redirection entry to change its destination APIC
  2. Legacy IRQs all route to BSPinit() hardcodes bsp_apic_id as destination
  3. MSI patches store affinityP8-msi.patch adds set_irq_affinity() API but doesn't reprogram IOAPIC hardware

Fix: Add IOAPIC IRQ Affinity

Add a function to reprogram the IOAPIC redirection table entry:

impl IoApic {
    pub fn set_irq_affinity(&self, gsi: u32, dest: ApicId) -> bool {
        let idx = (gsi - self.gsi_start) as u8;
        let mut guard = self.regs.lock();
        let Some(mut entry) = guard.read_ioredtbl(idx) else {
            return false;
        };
        // Clear destination (bits 63:56 for xAPIC, bits 63:32 for x2APIC)
        // xAPIC: destination is bits 63:56
        entry &= !(0xFF << 56);
        entry |= u64::from(dest.get()) << 56;
        guard.write_ioredtbl(idx, entry)
    }
}

Add a public API to find the right IOAPIC and call it:

pub fn set_affinity(irq: u8, dest: ApicId) {
    let gsi = resolve(irq);
    if let Some(apic) = find_ioapic(gsi) {
        apic.set_irq_affinity(gsi, dest);
    }
}

Implementation Steps

  1. Add IoApic::set_irq_affinity() method
  2. Add ioapic::set_affinity() public function
  3. Wire into kernel IRQ scheme set_affinity handler
  4. Add round-robin or numa-aware default affinity for new IRQs
  5. Rebuild kernel, verify boot

Bottleneck #4: Simple Spinlocks for Scheduler Queues

Severity: 🟡 Medium — unfair under contention
Files: context/switch.rs (run_contexts access)

Current State

Per-CPU run queues use spin::Mutex (simple spinlock). Under contention:

  • No fairness guarantee — a CPU may spin indefinitely
  • No backoff — constant cache line bouncing
  • No NUMA awareness — cross-socket contention is expensive

Fix: MCS Lock or Try-Lock with Backoff

Replace spin::Mutex with an MCS lock (John Mellor-Crummey and Michael Scott):

  • Each waiter spins on a local flag (cache-line friendly)
  • FIFO ordering guarantees fairness
  • O(1) cache line transfers on unlock

Alternatively, since per-CPU queues should have low contention:

  • Use try_lock() with exponential backoff
  • Fall back to global queue if per-CPU queue is contended

Implementation Steps

  1. Implement MCS lock primitive in sync/
  2. Replace spin::Mutex in run queue access
  3. Add contention statistics to PercpuBlock
  4. Rebuild kernel, verify boot

Bottleneck #5: No NUMA Topology Awareness

Severity: 🟡 Medium — treats all CPUs and memory as uniform
Files: acpi/madt/mod.rs, percpu.rs

Current State

  • No SRAT parsing (NUMA proximity domains)
  • No SLIT parsing (NUMA distance matrix)
  • Work stealing is random — may steal from a remote socket
  • Memory allocation is uniform — no preference for local node

Fix: ACPI SRAT + SLIT Parsing

  1. Parse SRAT (System Resource Affinity Table) for CPU-to-node mapping
  2. Parse SLIT (System Locality Information Table) for distance matrix
  3. Add numa_node: u32 to PercpuBlock
  4. Prefer stealing from same-socket CPUs
  5. Prefer allocating memory from local node

Implementation Steps

  1. Add SRAT/SLIT table parsing in acpi/
  2. Extend PercpuBlock with NUMA info
  3. Update work stealing to prefer local node
  4. Update memory allocator with NUMA hints
  5. Rebuild kernel, verify boot

Bottleneck #6: Coarse TLB Flush

Severity: 🟡 Low-Medium — full TLB flush instead of range-based
Files: percpu.rs:122

Current State

// percpu.rs:122
crate::memory::RmmA::invalidate_all();

Every TLB shootdown flushes the entire TLB, even when only a single page changed. Full TLB flush is extremely expensive on modern CPUs with large TLBs.

Fix: Range-Based and Single-Page Invalidation

Use x86 INVLPG for single-page invalidation:

// For single page:
x86::tlb::flush(page);

// For range:
for page in range.step_by(PAGE_SIZE) {
    x86::tlb::flush(page);
}
// Only use full flush for large ranges (> 32 pages)

Implementation Steps

  1. Add shootdown_range(start: Page, count: usize) to percpu
  2. Store shootdown range in PercpuBlock alongside flag
  3. Replace invalidate_all() with conditional INVLPG
  4. Fall back to full flush for large ranges (> 32 pages) or PCID flush
  5. Rebuild kernel, verify boot

Bottleneck #7: No Priority Inheritance

Severity: 🟡 Low — mutex priority inversion possible
Files: sync/ (various lock primitives)

Current State

No priority inheritance protocol. A low-priority thread holding a mutex can be preempted, causing a high-priority thread waiting on the same mutex to block indefinitely (priority inversion).

Fix: Priority Inheritance for Mutexes

Implement the Basic Priority Inheritance Protocol (PI):

  1. When a thread blocks on a mutex, donate its priority to the mutex holder
  2. When the mutex is released, restore the original priority
  3. Support multiple donors (priority queue of donors)

Implementation Steps

  1. Add donated_priority: Option<usize> to Context
  2. Implement priority donation in mutex lock acquisition
  3. Implement priority restoration in mutex unlock
  4. Add debug assertions to detect inversion
  5. Rebuild kernel, verify boot

Execution Timeline

Phase Bottleneck Duration Dependencies
1 #1 CONTEXT_SWITCH_LOCK 1-2 days None
2 #2 Broadcast TLB shootdown 1-2 days Phase 1 (per-CPU flags)
3 #3 IOAPIC IRQ affinity 1-2 days None
4 #4 MCS locks 2-3 days Phase 1 (reduced contention)
5 #6 Range TLB flush 1 day Phase 2 (shootdown infrastructure)
6 #5 NUMA awareness 3-5 days Phase 4 (scheduler queues)
7 #7 Priority inheritance 2-3 days None

Total estimate: 11-18 days

Patch Naming Convention

New kernel patches following this plan:

  • P9-percpu-context-switch.patch — Bottleneck #1
  • P9-broadcast-tlb-shootdown.patch — Bottleneck #2
  • P9-ioapic-irq-affinity.patch — Bottleneck #3
  • P9-mcs-locks.patch — Bottleneck #4
  • P9-range-tlb-flush.patch — Bottleneck #6
  • P9-numa-awareness.patch — Bottleneck #5
  • P9-priority-inheritance.patch — Bottleneck #7

All P9 patches must be applied after P8 patches in recipe.toml.