Files

T

vasilito cee25393d8 fix: boot process improvements — dependency cycle, INIT_NOTIFY, probing loop, and log spam fixes

- Fix P15-8-init-cycle-detection.patch: replace visiting+error with seen+silent-skip
  to eliminate 11 false-positive 'dependency cycle detected' errors on shared deps
- Fix P0-daemon-fix-init-notify-unwrap.patch: remove eprintln! for missing
  INIT_NOTIFY (expected for oneshot_async services, ~7 daemons affected)
- Fix driver-manager hotplug loop: add PERMANENTLY_SKIPPED static set shared
  between hotplug handler and DriverConfig::probe() to stop infinite re-probing
  of Fatal/NotSupported/deferred-exhausted device+driver pairs (e.g. ided)
- Fix driver-manager log_timeline: suppress repeated EPIPE/ENOENT errors with
  AtomicI32 dedup and AtomicBool one-shot guards for boot timeline JSON
- Add driver-manager SIGTERM handler, ACPI bus registration, --status mode,
  driver reap loop, graceful shutdown, and reduced deferred retries (30→3)

2026-05-17 12:34:02 +03:00

14 KiB

Raw Blame History

SMP/Scheduler Improvement Plan

Status: Active
Date: 2026-05-16
Authority: Canonical execution plan for SMP hardening
Priority order: Bottleneck #1 → #2 → #3 → #4 → #5 → #6 → #7

Context

Red Bear OS kernel has functional SMP (multi-core) support with x2APIC, per-CPU run queues, work stealing, and DWRR+vruntime scheduling. However, several design choices limit scalability on systems with more than 2-4 cores. This plan addresses the seven identified bottlenecks in priority order.

Reference Sources

Linux 7.1: local/reference/linux-7.1/ — scheduler, TLB, IRQ affinity
seL4: local/reference/seL4/ — lock-free kernel structures, minimal context switch
Zircon: online reference only (download failed due to network issues)

Key Kernel Files

File	Lines	Purpose
`context/switch.rs`	577	Scheduler, context switch, DWRR
`context/arch/x86_64.rs`	395	CONTEXT_SWITCH_LOCK, switch_to, register save/restore
`percpu.rs`	205	PercpuBlock, TLB shootdown
`arch/x86_shared/ipi.rs`	53	IPI kinds and dispatch
`arch/x86_shared/device/local_apic.rs`	312	x2APIC/xAPIC, ICR programming
`arch/x86_shared/device/ioapic.rs`	476	IOAPIC, IRQ routing, MapInfo
`acpi/madt/arch/x86.rs`	354	AP startup, SIPI, trampoline

Red Bear Patches (already applied)

Patch	Lines	Purpose
P8-percpu-sched	123	Per-CPU scheduler queues
P8-percpu-wiring	985	Work stealing, load balancing, vruntime, affinity
P8-work-stealing	190	Steal statistics, migration helpers
P8-msi	281	MSI/MSI-X foundation, vector allocation
P8-msi-foundation-v2	188	MSI refinement

Bottleneck #1: Global CONTEXT_SWITCH_LOCK

Severity: 🔴 Critical — serializes ALL context switches across ALL CPUs
Files: context/arch/x86_64.rs:19, context/switch.rs:123,156,171,296

Current State

// context/arch/x86_64.rs:19
pub static CONTEXT_SWITCH_LOCK: AtomicBool = AtomicBool::new(false);

// context/switch.rs:156-162
while arch::CONTEXT_SWITCH_LOCK
    .compare_exchange_weak(false, true, Ordering::SeqCst, Ordering::Relaxed)
    .is_err()
{
    hint::spin_loop();
    percpu.maybe_handle_tlb_shootdown();
}

Every context switch on ANY CPU must acquire this single global lock. On an 8-core system, 7 CPUs spin-wait while one CPU performs its context switch. This is the primary SMP scalability limiter.

Why It Exists

The comment says: "Acquire the global lock to ensure exclusive access during context switch and avoid issues that would be caused by the unsafe operations below."

The concern is that during switch_to, the CPU is in a transitional state:

Prev context's registers are saved
Next context's registers are being loaded
Stack pointer changes to next context's stack
switch_finish_hook runs to drop guards and release lock

During steps 1-4, another CPU switching to the same context could cause data races.

Analysis: The Lock Is Overly Conservative

The per-context write locks (Arc<ContextLock>) already prevent concurrent access to the same context. The switch() function:

Locks prev context (write) — prevents anyone else from modifying it
Locks next context (write) — prevents anyone else from modifying it
Updates running flags and CPU IDs (under both write locks)
Stores both guards in switch_result (kept alive until switch_finish_hook)
Calls arch::switch_to (register swap)

If CPU 0 holds write locks on contexts A and B, CPU 1 cannot lock either A or B. The global lock adds nothing for correctness — it only serializes independent switches involving completely different contexts.

Fix: Per-CPU Flag in PercpuBlock

Replace the global AtomicBool with a per-CPU flag in ContextSwitchPercpu:

// percpu.rs — add to ContextSwitchPercpu
pub in_context_switch: AtomicBool,

The per-CPU flag:

Each CPU acquires its own flag before switching — zero cross-CPU contention
Debug assertion catches re-entrant switches on the same CPU
Released in switch_finish_hook as before

Implementation Steps

Add in_context_switch: Cell<bool> to ContextSwitchPercpu in switch.rs
Remove CONTEXT_SWITCH_LOCK from context/arch/x86_64.rs (and aarch64, riscv64)
Replace arch::CONTEXT_SWITCH_LOCK.compare_exchange_weak(...) with per-CPU flag check
Replace arch::CONTEXT_SWITCH_LOCK.store(false, ...) with per-CPU flag release
Update switch_finish_hook accordingly
Rebuild kernel, verify boot

Risk Assessment

Low risk: The per-CPU flag is structurally equivalent to the global lock for each CPU. The global lock's only effect was preventing concurrent switches on different CPUs, which is unnecessary given per-context write locks.
Safety net: Keep the per-CPU flag as a debug assertion. If re-entrant switching is detected, panic instead of corrupting state.

Bottleneck #2: No Broadcast TLB Shootdown

Severity: 🔴 Critical — O(N) shootdown on N CPUs, each with individual IPI
Files: percpu.rs:75-113, ipi.rs:22-38

Current State

// percpu.rs:106-112 — shootdown_tlb_ipi when target is None (broadcast)
for id in 0..crate::cpu_count() {
    // TODO: Optimize: use global counter and percpu ack counters, send IPI using
    // destination shorthand "all CPUs".
    shootdown_tlb_ipi(Some(LogicalCpuId::new(id)));
}

Broadcast TLB shootdown is implemented as a loop, sending individual IPIs to each CPU. Each IPI requires:

Set wants_tlb_shootdown flag on target CPU
Spin-wait if previous shootdown is still pending
Send IPI via ipi_single()
Target CPU processes IPI in interrupt handler

On a 128-core system, this means 127 individual IPI sends, each with spin-wait overhead.

Fix: x2APIC Destination Shorthand

The Local APIC supports destination shorthands in the ICR:

01b = "self" (Current)
10b = "all including self" (All)
11b = "all except self" (Other)

The IpiTarget enum already defines these values (ipi.rs:15-19), and the ipi() function (ipi.rs:22-38) already supports them. We just need to use IpiTarget::Other for broadcast TLB shootdowns.

Implementation Steps

Add tlb_shootdown_pending: AtomicU32 ACK counter to PercpuBlock
Add global TLB_SHOOTDOWN_GENERATION: AtomicU32 counter
In shootdown_tlb_ipi(None):
- Increment generation counter
- Set wants_tlb_shootdown on all CPUs (lock-free)
- Send single IPI with IpiTarget::Other shorthand
In maybe_handle_tlb_shootdown():
- Process shootdown
- Increment ACK counter
Add wait_for_tlb_acknowledgments() with timeout
Rebuild kernel, verify boot

x2APIC ICR Format

For x2APIC, the ICR is a single 64-bit MSR write:

Bits 63:32 = Destination APIC ID (ignored for shorthands)
Bits 19:18 = Destination Shorthand (0=none, 1=self, 2=all, 3=all-except-self)
Bit  14    = Trigger Mode (0=edge, 1=level)
Bits 11:8  = Delivery Mode (0=fixed)
Bits 7:0   = Vector

For "all except self" broadcast with TLB vector (0x41):

let icr = (3u64 << 18) | (1 << 14) | (IpiKind::Tlb as u64);
// = 0x000C0000_00000041

Bottleneck #3: IRQ Affinity Not Wired to IOAPIC

Severity: 🟡 Medium — stored but never applied to hardware
Files: ioapic.rs, MSI patches P8-msi.patch

Current State

The IOAPIC MapInfo struct has a dest: ApicId field, and DestinationMode enum has Logical variant. However:

No set_affinity() function — there's no way to reprogram an IOAPIC redirection entry to change its destination APIC
Legacy IRQs all route to BSP — init() hardcodes bsp_apic_id as destination
MSI patches store affinity — P8-msi.patch adds set_irq_affinity() API but doesn't reprogram IOAPIC hardware

Fix: Add IOAPIC IRQ Affinity

Add a function to reprogram the IOAPIC redirection table entry:

impl IoApic {
    pub fn set_irq_affinity(&self, gsi: u32, dest: ApicId) -> bool {
        let idx = (gsi - self.gsi_start) as u8;
        let mut guard = self.regs.lock();
        let Some(mut entry) = guard.read_ioredtbl(idx) else {
            return false;
        };
        // Clear destination (bits 63:56 for xAPIC, bits 63:32 for x2APIC)
        // xAPIC: destination is bits 63:56
        entry &= !(0xFF << 56);
        entry |= u64::from(dest.get()) << 56;
        guard.write_ioredtbl(idx, entry)
    }
}

Add a public API to find the right IOAPIC and call it:

pub fn set_affinity(irq: u8, dest: ApicId) {
    let gsi = resolve(irq);
    if let Some(apic) = find_ioapic(gsi) {
        apic.set_irq_affinity(gsi, dest);
    }
}

Implementation Steps

Add IoApic::set_irq_affinity() method
Add ioapic::set_affinity() public function
Wire into kernel IRQ scheme set_affinity handler
Add round-robin or numa-aware default affinity for new IRQs
Rebuild kernel, verify boot

Bottleneck #4: Simple Spinlocks for Scheduler Queues

Severity: 🟡 Medium — unfair under contention
Files: context/switch.rs (run_contexts access)

Current State

Per-CPU run queues use spin::Mutex (simple spinlock). Under contention:

No fairness guarantee — a CPU may spin indefinitely
No backoff — constant cache line bouncing
No NUMA awareness — cross-socket contention is expensive

Fix: MCS Lock or Try-Lock with Backoff

Replace spin::Mutex with an MCS lock (John Mellor-Crummey and Michael Scott):

Each waiter spins on a local flag (cache-line friendly)
FIFO ordering guarantees fairness
O(1) cache line transfers on unlock

Alternatively, since per-CPU queues should have low contention:

Use try_lock() with exponential backoff
Fall back to global queue if per-CPU queue is contended

Implementation Steps

Implement MCS lock primitive in sync/
Replace spin::Mutex in run queue access
Add contention statistics to PercpuBlock
Rebuild kernel, verify boot

Bottleneck #5: No NUMA Topology Awareness

Severity: 🟡 Medium — treats all CPUs and memory as uniform
Files: acpi/madt/mod.rs, percpu.rs

Current State

No SRAT parsing (NUMA proximity domains)
No SLIT parsing (NUMA distance matrix)
Work stealing is random — may steal from a remote socket
Memory allocation is uniform — no preference for local node

Fix: ACPI SRAT + SLIT Parsing

Parse SRAT (System Resource Affinity Table) for CPU-to-node mapping
Parse SLIT (System Locality Information Table) for distance matrix
Add numa_node: u32 to PercpuBlock
Prefer stealing from same-socket CPUs
Prefer allocating memory from local node

Implementation Steps

Add SRAT/SLIT table parsing in acpi/
Extend PercpuBlock with NUMA info
Update work stealing to prefer local node
Update memory allocator with NUMA hints
Rebuild kernel, verify boot

Bottleneck #6: Coarse TLB Flush

Severity: 🟡 Low-Medium — full TLB flush instead of range-based
Files: percpu.rs:122

Current State

// percpu.rs:122
crate::memory::RmmA::invalidate_all();

Every TLB shootdown flushes the entire TLB, even when only a single page changed. Full TLB flush is extremely expensive on modern CPUs with large TLBs.

Fix: Range-Based and Single-Page Invalidation

Use x86 INVLPG for single-page invalidation:

// For single page:
x86::tlb::flush(page);

// For range:
for page in range.step_by(PAGE_SIZE) {
    x86::tlb::flush(page);
}
// Only use full flush for large ranges (> 32 pages)

Implementation Steps

Add shootdown_range(start: Page, count: usize) to percpu
Store shootdown range in PercpuBlock alongside flag
Replace invalidate_all() with conditional INVLPG
Fall back to full flush for large ranges (> 32 pages) or PCID flush
Rebuild kernel, verify boot

Bottleneck #7: No Priority Inheritance

Severity: 🟡 Low — mutex priority inversion possible
Files: sync/ (various lock primitives)

Current State

No priority inheritance protocol. A low-priority thread holding a mutex can be preempted, causing a high-priority thread waiting on the same mutex to block indefinitely (priority inversion).

Fix: Priority Inheritance for Mutexes

Implement the Basic Priority Inheritance Protocol (PI):

When a thread blocks on a mutex, donate its priority to the mutex holder
When the mutex is released, restore the original priority
Support multiple donors (priority queue of donors)

Implementation Steps

Add donated_priority: Option<usize> to Context
Implement priority donation in mutex lock acquisition
Implement priority restoration in mutex unlock
Add debug assertions to detect inversion
Rebuild kernel, verify boot

Execution Timeline

Phase	Bottleneck	Duration	Dependencies
1	#1 CONTEXT_SWITCH_LOCK	1-2 days	None
2	#2 Broadcast TLB shootdown	1-2 days	Phase 1 (per-CPU flags)
3	#3 IOAPIC IRQ affinity	1-2 days	None
4	#4 MCS locks	2-3 days	Phase 1 (reduced contention)
5	#6 Range TLB flush	1 day	Phase 2 (shootdown infrastructure)
6	#5 NUMA awareness	3-5 days	Phase 4 (scheduler queues)
7	#7 Priority inheritance	2-3 days	None

Total estimate: 11-18 days

Patch Naming Convention

New kernel patches following this plan:

P9-percpu-context-switch.patch — Bottleneck #1
P9-broadcast-tlb-shootdown.patch — Bottleneck #2
P9-ioapic-irq-affinity.patch — Bottleneck #3
P9-mcs-locks.patch — Bottleneck #4
P9-range-tlb-flush.patch — Bottleneck #6
P9-numa-awareness.patch — Bottleneck #5
P9-priority-inheritance.patch — Bottleneck #7

All P9 patches must be applied after P8 patches in recipe.toml.

14 KiB Raw Blame History

SMP/Scheduler Improvement Plan

Context

Reference Sources

Key Kernel Files

Red Bear Patches (already applied)

Bottleneck #1: Global CONTEXT_SWITCH_LOCK

Current State

Why It Exists

Analysis: The Lock Is Overly Conservative

Fix: Per-CPU Flag in PercpuBlock

Implementation Steps

Risk Assessment

Bottleneck #2: No Broadcast TLB Shootdown

Current State

Fix: x2APIC Destination Shorthand

Implementation Steps

x2APIC ICR Format

Bottleneck #3: IRQ Affinity Not Wired to IOAPIC

Current State

Fix: Add IOAPIC IRQ Affinity

Implementation Steps

Bottleneck #4: Simple Spinlocks for Scheduler Queues

Current State

Fix: MCS Lock or Try-Lock with Backoff

Implementation Steps

Bottleneck #5: No NUMA Topology Awareness

Current State

Fix: ACPI SRAT + SLIT Parsing

Implementation Steps

Bottleneck #6: Coarse TLB Flush

Current State

Fix: Range-Based and Single-Page Invalidation

Implementation Steps

Bottleneck #7: No Priority Inheritance

Current State

Fix: Priority Inheritance for Mutexes

Implementation Steps

Execution Timeline

Patch Naming Convention

14 KiB

Raw Blame History