- Fix P15-8-init-cycle-detection.patch: replace visiting+error with seen+silent-skip to eliminate 11 false-positive 'dependency cycle detected' errors on shared deps - Fix P0-daemon-fix-init-notify-unwrap.patch: remove eprintln! for missing INIT_NOTIFY (expected for oneshot_async services, ~7 daemons affected) - Fix driver-manager hotplug loop: add PERMANENTLY_SKIPPED static set shared between hotplug handler and DriverConfig::probe() to stop infinite re-probing of Fatal/NotSupported/deferred-exhausted device+driver pairs (e.g. ided) - Fix driver-manager log_timeline: suppress repeated EPIPE/ENOENT errors with AtomicI32 dedup and AtomicBool one-shot guards for boot timeline JSON - Add driver-manager SIGTERM handler, ACPI bus registration, --status mode, driver reap loop, graceful shutdown, and reduced deferred retries (30→3)
18 KiB
Red Bear OS SMP Boot & Scheduler Hardening Plan
Version: 1.0 — 2026-05-16
Status: Active
Canonical: This document supersedes SMP-SCHEDULER-IMPROVEMENT-PLAN.md for forward work.
Scope: Kernel SMP, AP startup, x2APIC, per-CPU data, TLB shootdowns, IRQ routing, scheduler, userspace boot, daemon robustness.
Assessment Summary
Comprehensive assessment of kernel SMP infrastructure (20 source files), userspace boot process (10 source files), and modern Intel/AMD MP specifications. Cross-referenced with Linux smpboot.c, Zircon lk_main, and seL4 multicore boot.
Total issues found: 38 kernel + 16 userspace = 54 issues
- Critical: 6 kernel + 3 userspace = 9
- High: 7 kernel + 4 userspace = 11
- Medium: 10 kernel + 5 userspace = 15
- Low: 15 kernel + 4 userspace = 19
Kernel SMP Issues
Critical (6)
| # | Issue | File | Root Cause |
|---|---|---|---|
| K1 | AP startup LogicalCpuId race | madt/arch/x86.rs:153,244,276,365 |
Two APs CPU_COUNT.load(Relaxed) → same ID → both fetch_add(1) |
| K2 | AP_READY dual-mechanism sync race | madt/arch/x86.rs:174-225 |
Trampoline u64 ap_ready.write(0) + static AtomicBool AP_READY — inconsistent ordering, UB on cast |
| K3 | TLB shootdown range race | percpu.rs:134-137 |
Concurrent shootdowns overwrite tlb_flush_start/tlb_flush_count between flag set and IPI |
| K4 | MCS lock missing memory fences | sync/mcs.rs:74-101 |
No Release after next.store(), no Acquire before locked.load() |
| K5 | Unbounded priority inversion chain | sync/mcs.rs:126-145 |
PI donation goes one level only; transitive chains unbounded |
| K6 | Scheduler context switch flag not cleared on panic | switch.rs:164,298 |
in_context_switch stays true → permanent CPU lockup |
High (7)
| # | Issue | File | Root Cause |
|---|---|---|---|
| K7 | Missing SIPI timing delays | madt/arch/x86.rs:192-337 |
Spin-count delays, not TSC-based. Intel SDM requires 10ms INIT→SIPI |
| K8 | NUMA node set after CPU visible | madt/arch/x86.rs:244,253 |
CPU_COUNT.fetch_add() before numa_node.set() |
| K9 | Empty memory fence before AP starts | madt/arch/x86.rs:188 |
asm!("") is compiler barrier only, not hardware fence |
| K10 | TLB range Relaxed ordering | percpu.rs:146,179 |
Range stores use Relaxed, no barrier before IPI send |
| K11 | IOAPIC affinity no CPU online check | ioapic.rs:126-137 |
Accepts any ApicId without validation |
| K12 | MAX_CPU_COUNT=128 too small | cpu_set.rs:44 |
AMD EPYC has 128C/256T, Threadripper PRO 96C/192T |
| K13 | Global IRQ count lock | scheme/irq.rs:67 |
COUNTS.lock() is global spinlock on hot path |
Medium (10)
| # | Issue | File | Root Cause |
|---|---|---|---|
| K14 | x2APIC detection no fallback | local_apic.rs:56-66 |
If x2APIC init fails, no fallback to xAPIC |
| K15 | AP startup timeout not time-based | madt/arch/x86.rs:44 |
AP_SPIN_LIMIT=1_000_000 spin counts vary by clock speed |
| K16 | TLB shootdown no timeout | percpu.rs:134-143 |
Spin waits indefinitely if target CPU crashed |
| K17 | Broadcast shootdown sequential flag-setting | percpu.rs:151-184 |
O(n) flag set loop on 128+ core systems |
| K18 | PI donation write-once | sync/mcs.rs:62 |
Later higher-priority waiter doesn't update |
| K19 | PI donation Relaxed ordering | sync/mcs.rs:142 |
pi_donated_prio.store(Relaxed) may not be visible |
| K20 | Scheduler NUMA-unaware | switch.rs:357-495 |
same_node() exists but never used in work stealing |
| K21 | IOAPIC legacy IRQs always BSP | ioapic.rs:392 |
IRQs 0-15 hardcoded to BSP, no load balancing |
| K22 | RSDP no BIOS scan fallback | rsdp.rs:19-48 |
Only uses bootloader-supplied address |
| K23 | No SDT checksum validation | acpi/mod.rs:94-180 |
Only RSDP checksum verified, not child SDTs |
Low (15)
K24–K38: Trampoline writable+executable, fixed trampoline address 0x8000, no SIPI delivery status check, no PercpuBlock cleanup on AP failure, PercpuBlock registration race, no NUMA barrier, hardcoded preemption timer, no preemption guard enforcement, no MCS recursive detection, scheduler recursion limitation, MADT unknown types silently ignored, no MADT revision check, no SLIT diagonal validation, RSDP length bounds too loose, no APIC ESR clear before SIPI.
Userspace Boot Issues
Critical (3)
| # | Issue | File | Root Cause |
|---|---|---|---|
| U1 | Init dependency deadlock | redbear-mini.toml:244-256 |
00_intel-gpiod.service has default_dependencies=true → circular wait with driver-manager |
| U2 | No service timeout | service.rs:78-118 |
Notify/Scheme types block forever if daemon hangs |
| U3 | Dependency cycle detection missing | scheduler.rs:77-95 |
BFS load_units() loops forever on circular requires_weak |
High (4)
| # | Issue | File | Root Cause |
|---|---|---|---|
| U4 | No daemon restart policy | init system | Crashed daemons stay dead, no auto-restart |
| U5 | No crash cleanup | driver-manager | Spontaneous crash doesn't release scheme/PCI/IRQ |
| U6 | Boot timeline /tmp/ missing | driver-manager main.rs:24 |
Writes to /tmp/... without ensuring /tmp exists |
| U7 | Hotplug redundant enumeration | hotplug.rs:31-40 |
Full PCI/ACPI re-scan every 2s |
Medium (5)
U8–U12: Hotplug unbound device removal bug, ided I/O privilege expect(), serial boot markers blocking 800ms, limited parallelism (50/step), no queue overflow handling.
Low (4)
U13–U16: PCI enumeration no timeout, async enumeration no join timeout, boot status command broken if no timeline, no driver health endpoint.
Reference: Modern Hardware Requirements
Sources: Intel 64/IA-32 SDM Vol 3A Ch 8, AMD64 APM Vol 2 Ch 7, ACPI 6.5, Intel x2APIC spec, Linux smpboot.c, Zircon lk_main, seL4 multicore boot.
AP Startup Timing (Intel SDM)
- INIT deassert → SIPI: 10ms (modern CPUs: can be shorter)
- SIPI #1 → SIPI #2: 10-300µs (modern: 10µs, legacy: 300µs)
- AP response timeout: 10 seconds (Linux)
- ESR check: Clear before each SIPI, read after to verify acceptance
AP Startup Timing (AMD)
- Similar INIT/SIPI sequence
- CPUID leaf
0x8000001Efor topology (ext_apic_id, core_id, node_id) - CPUID leaf
0x1Fpreferred for V2 extended topology (Intel + newer AMD) - APIC ID may exceed 255 → x2APIC mandatory
x2APIC Requirements
- Mandatory: CPU count > 255 (8-bit APIC ID exhausted)
- Detection: CPUID.01H:ECX[bit 21]
- ICR: Single 64-bit MSR write (vs two 32-bit MMIO writes)
- No delivery status bit: Hardware guarantees delivery
- Self-IPI: Dedicated MSR 0x83F (fastest single-IPI path)
ACPI MADT Entry Types (ACPI 6.5)
- Type 0: Processor Local APIC (legacy 8-bit)
- Type 1: I/O APIC
- Type 2: Interrupt Source Override
- Type 4: Local APIC NMI
- Type 5: Local APIC Address Override
- Type 9: Processor Local x2APIC (32-bit ID, required for modern hardware)
- Type 10: Local x2APIC NMI
- Type 20: Multi-Processor Wakeup Structure (ACPI 6.4+)
Common Firmware Bugs
- Duplicate APIC IDs in MADT
- Incorrect enabled flags
- Missing entries (CPU exists but no MADT entry)
- MADT UID / DSDT _UID mismatch
- SLIT diagonal != 10 (Linux validates and rejects)
- SRAT-SLIT inconsistency
Linux Best Practices
- Parallel AP bringup (all APs kicked simultaneously) — reduces boot 500ms→100ms on 96-core
- Adaptive SIPI timing:
init_udelay=0→ 10µs for modern CPUs - 10-second timeout with
schedule()yield loop - ESR check after each SIPI, retry up to 2×
cpu_callout_mask/cpu_callin_maskhandshake
Zircon Best Practices
- Phased initialization: BSP → topology → AP release → AP init → sync
- 30-second startup timeout, OOPS (not panic) on timeout
- Idle threads pre-allocated before releasing APs
- Init levels coordinate initialization order
seL4 Best Practices
- Single atomic write releases all APs simultaneously
- Explicit cache maintenance for ARM32
- Big kernel lock for simplicity (not scalable)
- BOOT_BSS section for boot-time variables
Improvement Plan — Patch Series
Priority 0: Fix All Discovered Issues (P15)
P15-1: AP Startup LogicalCpuId Race Fix (Critical K1)
Files: src/acpi/madt/arch/x86.rs
Change: Replace CPU_COUNT.load(Relaxed) + LogicalCpuId::new(next_cpu) + CPU_COUNT.fetch_add(1) with single let cpu_id = LogicalCpuId::new(CPU_COUNT.fetch_add(1, SeqCst)). Remove separate load. Move all pre-startup setup (PercpuBlock init, NUMA node set) to between allocation and fetch_add.
Risk: Low. Standard atomic fix.
Verification: Boot with 4+ CPUs, verify all get unique IDs.
P15-2: AP_READY Sync Consolidation (Critical K2)
Files: src/acpi/madt/arch/x86.rs
Change: Replace dual mechanism with single AtomicU8 at TRAMPOLINE+8. AP writes 1 when ready. BSP polls with SeqCst. Add fence(SeqCst) before/after writing trampoline args to ensure AP sees them.
Risk: Medium. Changes trampoline protocol.
Verification: Boot test on QEMU, verify all APs start correctly.
P15-3: TLB Shootdown Range Race Fix (Critical K3)
Files: src/percpu.rs
Change: Pack range into single AtomicU64 (bits [63:32] = start page, bits [31:0] = count). Single atomic swap sets flag + range atomically. Handler unpacks with single load.
Risk: Medium. Affects all TLB shootdowns.
Verification: Multi-core stress test with frequent mmap/munmap.
P15-4: MCS Lock Memory Ordering (Critical K4)
Files: src/sync/mcs.rs
Change: Add fence(Release) after next.store(new_node, Relaxed) at line 55. Add fence(Acquire) before locked.load(Relaxed) at line 59. Change PI donation store to Release.
Risk: Low. Standard lock ordering fix.
Verification: Multi-threaded contention test.
P15-5: NUMA Node Before CPU Visible (High K8)
Files: src/acpi/madt/arch/x86.rs
Change: Move record_apic_mapping() and percpu.numa_node.set() BEFORE CPU_COUNT.fetch_add(). Add fence(SeqCst) between them so scheduler sees NUMA data.
Risk: Low. Reordering of operations.
Verification: Boot with QEMU SRAT, verify NUMA nodes set before scheduler sees CPUs.
P15-6: Init Dependency Deadlock Fix (Critical U1)
Files: config/redbear-mini.toml, config/redbear-full.toml
Change: Add default_dependencies = false to 00_intel-gpiod.service, 00_i2c-dw-acpi.service, 00_i2c-gpio-expanderd.service, 00_i2c-hidd.service, ucsid.service. Add explicit requires_weak for actual dependencies only.
Risk: Low. Config-only change.
Verification: Boot redbear-mini, verify all services start without deadlock.
P15-7: Service Timeout Mechanism (Critical U2)
Files: recipes/core/base/source/init/src/service.rs, recipes/core/base/source/init/src/scheduler.rs
Change: Add timeout_secs: Option<u32> to Notify and Scheme variants. Use set_read_timeout() on INIT_NOTIFY pipe. On timeout, log error and mark service failed. Boot continues.
Risk: Medium. Changes init behavior.
Verification: Create a service that never notifies, verify boot continues after timeout.
P15-8: Dependency Cycle Detection (Critical U3)
Files: recipes/core/base/source/init/src/scheduler.rs
Change: Add BTreeSet<UnitId> visited tracking in load_units(). If a unit ID is already in the visiting set, log cycle error and skip.
Risk: Low. Defensive programming.
Verification: Create circular dependency in test config, verify detection.
P15-9: Boot Timeline /tmp/ Creation (Medium U6)
Files: local/recipes/system/driver-manager/source/src/main.rs
Change: Add let _ = std::fs::create_dir_all("/tmp"); at top of main(), before reset_timeline_log().
Risk: Trivial.
Verification: Boot, verify timeline file created.
P15-10: TLB Range Ordering Fix (High K10)
Files: src/percpu.rs
Change: Change tlb_flush_start/tlb_flush_count stores from Relaxed to Release. Change handler loads from Relaxed to Acquire.
Risk: Low. Ordering fix.
Verification: Multi-core TLB stress test.
Priority 1: Stabilize SMP Boot (P16)
P16-1: Calibrated SIPI Delays (High K7)
Files: src/acpi/madt/arch/x86.rs
Change: Implement udelay(us) using TSC (calibrated during early boot). Replace spin-count delays: 10ms INIT→SIPI, 10µs SIPI→SIPI for modern CPUs.
Reference: Linux wakeup_secondary_cpu_via_init(), Intel SDM Vol 3A §8.4.
P16-2: AP Startup Error Status Check (Medium)
Files: src/acpi/madt/arch/x86.rs
Change: After each SIPI, clear and read APIC ESR. If delivery error, retry. Log failure for each CPU. Continue boot with available CPUs.
Reference: Linux checks send_status + accept_status.
P16-3: MAX_CPU_COUNT Increase (High K12)
Files: src/cpu_set.rs
Change: Increase MAX_CPU_COUNT from 128 to 256. Add boot-time warning if CPUs approach limit.
P16-4: AP Startup Graceful Degradation
Files: src/acpi/madt/arch/x86.rs
Change: If AP fails trampoline or AP_READY timeout, log warning, skip CPU, continue boot. Track cpu_online mask separately from cpu_possible.
P16-5: Firmware Bug Detection
Files: src/acpi/madt/mod.rs, src/acpi/mod.rs
Change: Add duplicate APIC ID detection during MADT parsing. Add SDT checksum validation (sum all bytes == 0). Log warnings for unknown MADT entry types. Cross-reference MADT entries with SRAT for consistency.
Priority 2: Desktop-Safe Scheduler (P17)
P17-1: NUMA-Aware Work Stealing (Medium K20)
Files: src/context/switch.rs
Change: In select_next_context(), prefer contexts on same NUMA node. Use numa::topology().same_node(). Apply penalty for cross-node stealing. Use SLIT distance matrix for weight.
P17-2: Transitive Priority Inheritance (Critical K5)
Files: src/sync/mcs.rs
Change: When donating priority to lock holder, check if holder is waiting on another MCS lock. Propagate donation transitively up to 4 levels deep (bounded). Add lock graph cycle detection.
P17-3: CPU Affinity (New Feature)
Files: src/context/context.rs, src/context/switch.rs
Change: Add affinity: LogicalCpuSet to Context. Scheduler respects mask. Default: all CPUs. Add sched_setaffinity syscall.
P17-4: Preemption Latency Bounds
Files: src/context/switch.rs
Change: Replace hardcoded new_ticks >= 3 with configurable interval. Enforce preempt_locks > 0 guard at context switch. Add preemption-safe lock wrappers.
P17-5: Load Balancing
Files: src/context/switch.rs
Change: Periodic (every 100ms) load rebalancing. Migrate tasks from CPUs with >2 runnable to idle CPUs. Use NUMA distance for cross-node decisions.
Priority 3: Harden IPC & Scheme Servers (P18)
P18-1: Daemon Restart Policy (High U4)
Files: recipes/core/base/source/init/src/service.rs, recipes/core/base/source/init/src/scheduler.rs
Change: Add restart = "on-failure" | "always" | "never" to service config. Implement exponential backoff: 1s → 2s → 4s → 8s → 30s max. Track restart count, give up after 5 consecutive failures.
P18-2: Process Monitoring & Cleanup (High U5)
Files: local/recipes/system/driver-manager/source/src/config.rs
Change: Non-blocking waitpid(WNOHANG) poll in hotplug loop. On driver exit: release scheme, unbind PCI device, free IRQ. Notify init of failure.
P18-3: Bounded Scheme Request Queues (Medium)
Change: Add configurable queue depth limit to scheme daemons. When full, return EBUSY. Prevents memory exhaustion.
P18-4: Watchdog/Health Monitoring (High)
Change: Optional health-check ping in scheme protocol. Init checks critical services every 5s. On failure, restart per restart policy.
Priority 4: Stress-Test Userspace Drivers (P19)
P19-1: Multi-Core Driver Stress Test
Change: Parallel I/O to ided/ahcid/nvmed + network e1000d + input evdevd. Verify no panics, no hangs, no data corruption over 1 hour.
P19-2: GPU Parallel Submission
Change: Multiple processes submit to virtio-gpu / redox-drm simultaneously. Verify fencing correctness, no GPU hang.
P19-3: USB Hotplug Under Load
Change: Rapid device connect/disconnect while transferring data via usbscsid. Verify cleanup and no resource leaks.
P19-4: Hotplug Stress Test
Change: QEMU virtio device hot-add/hot-remove while system under load. Verify driver-manager handles changes correctly.
Estimated Effort
| Priority | Patches | Lines | Time |
|---|---|---|---|
| P0 (Fix discovered) | P15-1 through P15-10 | ~800 | 2-3 days |
| P1 (SMP stabilize) | P16-1 through P16-5 | ~500 | 2-3 days |
| P2 (Scheduler) | P17-1 through P17-5 | ~1200 | 5-7 days |
| P3 (IPC harden) | P18-1 through P18-4 | ~800 | 3-5 days |
| P4 (Stress test) | P19-1 through P19-4 | ~600 | 2-3 days |
| Total | 24 patches | ~3900 | 14-21 days |
Acceptance Criteria
- All Critical and High issues resolved
- Boot to login prompt in <10s on QEMU (4 cores)
- No panics under 72-hour stress test (4 cores, all driver types)
- AP startup race-free with 128 simulated CPUs
- NUMA topology correctly discovered from QEMU SRAT
- Service restart within 5 seconds of crash
- No priority inversion >100ms under load
- All patches in
local/patches/kernel/, wired intorecipe.toml - Boot-tested on QEMU UEFI with
scripts/run_mini.sh
Dependency Graph
P15-1 (CPU_COUNT race) ─────┐
P15-2 (AP_READY sync) ──────┤
P15-3 (TLB range race) ─────┤
P15-4 (MCS ordering) ───────┼──→ P16-1 (SIPI timing)
P15-5 (NUMA ordering) ──────┤ P16-2 (ESR check)
P15-10 (TLB ordering) ──────┘ P16-3 (MAX_CPU)
P16-4 (graceful degradation)
P15-6 (init deadlock) ──────────→ P16-5 (firmware bugs)
P15-7 (service timeout)
P15-8 (cycle detection)
P15-9 (/tmp creation)
P16-* ──→ P17-1 (NUMA work stealing)
P17-2 (transitive PI)
P17-3 (CPU affinity)
P17-4 (preemption)
P17-5 (load balancing)
P17-* ──→ P18-1 (restart policy)
P18-2 (crash cleanup)
P18-3 (bounded queues)
P18-4 (watchdog)
P18-* ──→ P19-* (stress tests)