Files
RedBear-OS/local/docs/SMP-BOOT-HARDENING-PLAN.md
T
vasilito cee25393d8 fix: boot process improvements — dependency cycle, INIT_NOTIFY, probing loop, and log spam fixes
- Fix P15-8-init-cycle-detection.patch: replace visiting+error with seen+silent-skip
  to eliminate 11 false-positive 'dependency cycle detected' errors on shared deps
- Fix P0-daemon-fix-init-notify-unwrap.patch: remove eprintln! for missing
  INIT_NOTIFY (expected for oneshot_async services, ~7 daemons affected)
- Fix driver-manager hotplug loop: add PERMANENTLY_SKIPPED static set shared
  between hotplug handler and DriverConfig::probe() to stop infinite re-probing
  of Fatal/NotSupported/deferred-exhausted device+driver pairs (e.g. ided)
- Fix driver-manager log_timeline: suppress repeated EPIPE/ENOENT errors with
  AtomicI32 dedup and AtomicBool one-shot guards for boot timeline JSON
- Add driver-manager SIGTERM handler, ACPI bus registration, --status mode,
  driver reap loop, graceful shutdown, and reduced deferred retries (30→3)
2026-05-17 12:34:02 +03:00

18 KiB
Raw Blame History

Red Bear OS SMP Boot & Scheduler Hardening Plan

Version: 1.0 — 2026-05-16 Status: Active Canonical: This document supersedes SMP-SCHEDULER-IMPROVEMENT-PLAN.md for forward work. Scope: Kernel SMP, AP startup, x2APIC, per-CPU data, TLB shootdowns, IRQ routing, scheduler, userspace boot, daemon robustness.

Assessment Summary

Comprehensive assessment of kernel SMP infrastructure (20 source files), userspace boot process (10 source files), and modern Intel/AMD MP specifications. Cross-referenced with Linux smpboot.c, Zircon lk_main, and seL4 multicore boot.

Total issues found: 38 kernel + 16 userspace = 54 issues

  • Critical: 6 kernel + 3 userspace = 9
  • High: 7 kernel + 4 userspace = 11
  • Medium: 10 kernel + 5 userspace = 15
  • Low: 15 kernel + 4 userspace = 19

Kernel SMP Issues

Critical (6)

# Issue File Root Cause
K1 AP startup LogicalCpuId race madt/arch/x86.rs:153,244,276,365 Two APs CPU_COUNT.load(Relaxed) → same ID → both fetch_add(1)
K2 AP_READY dual-mechanism sync race madt/arch/x86.rs:174-225 Trampoline u64 ap_ready.write(0) + static AtomicBool AP_READY — inconsistent ordering, UB on cast
K3 TLB shootdown range race percpu.rs:134-137 Concurrent shootdowns overwrite tlb_flush_start/tlb_flush_count between flag set and IPI
K4 MCS lock missing memory fences sync/mcs.rs:74-101 No Release after next.store(), no Acquire before locked.load()
K5 Unbounded priority inversion chain sync/mcs.rs:126-145 PI donation goes one level only; transitive chains unbounded
K6 Scheduler context switch flag not cleared on panic switch.rs:164,298 in_context_switch stays true → permanent CPU lockup

High (7)

# Issue File Root Cause
K7 Missing SIPI timing delays madt/arch/x86.rs:192-337 Spin-count delays, not TSC-based. Intel SDM requires 10ms INIT→SIPI
K8 NUMA node set after CPU visible madt/arch/x86.rs:244,253 CPU_COUNT.fetch_add() before numa_node.set()
K9 Empty memory fence before AP starts madt/arch/x86.rs:188 asm!("") is compiler barrier only, not hardware fence
K10 TLB range Relaxed ordering percpu.rs:146,179 Range stores use Relaxed, no barrier before IPI send
K11 IOAPIC affinity no CPU online check ioapic.rs:126-137 Accepts any ApicId without validation
K12 MAX_CPU_COUNT=128 too small cpu_set.rs:44 AMD EPYC has 128C/256T, Threadripper PRO 96C/192T
K13 Global IRQ count lock scheme/irq.rs:67 COUNTS.lock() is global spinlock on hot path

Medium (10)

# Issue File Root Cause
K14 x2APIC detection no fallback local_apic.rs:56-66 If x2APIC init fails, no fallback to xAPIC
K15 AP startup timeout not time-based madt/arch/x86.rs:44 AP_SPIN_LIMIT=1_000_000 spin counts vary by clock speed
K16 TLB shootdown no timeout percpu.rs:134-143 Spin waits indefinitely if target CPU crashed
K17 Broadcast shootdown sequential flag-setting percpu.rs:151-184 O(n) flag set loop on 128+ core systems
K18 PI donation write-once sync/mcs.rs:62 Later higher-priority waiter doesn't update
K19 PI donation Relaxed ordering sync/mcs.rs:142 pi_donated_prio.store(Relaxed) may not be visible
K20 Scheduler NUMA-unaware switch.rs:357-495 same_node() exists but never used in work stealing
K21 IOAPIC legacy IRQs always BSP ioapic.rs:392 IRQs 0-15 hardcoded to BSP, no load balancing
K22 RSDP no BIOS scan fallback rsdp.rs:19-48 Only uses bootloader-supplied address
K23 No SDT checksum validation acpi/mod.rs:94-180 Only RSDP checksum verified, not child SDTs

Low (15)

K24K38: Trampoline writable+executable, fixed trampoline address 0x8000, no SIPI delivery status check, no PercpuBlock cleanup on AP failure, PercpuBlock registration race, no NUMA barrier, hardcoded preemption timer, no preemption guard enforcement, no MCS recursive detection, scheduler recursion limitation, MADT unknown types silently ignored, no MADT revision check, no SLIT diagonal validation, RSDP length bounds too loose, no APIC ESR clear before SIPI.


Userspace Boot Issues

Critical (3)

# Issue File Root Cause
U1 Init dependency deadlock redbear-mini.toml:244-256 00_intel-gpiod.service has default_dependencies=true → circular wait with driver-manager
U2 No service timeout service.rs:78-118 Notify/Scheme types block forever if daemon hangs
U3 Dependency cycle detection missing scheduler.rs:77-95 BFS load_units() loops forever on circular requires_weak

High (4)

# Issue File Root Cause
U4 No daemon restart policy init system Crashed daemons stay dead, no auto-restart
U5 No crash cleanup driver-manager Spontaneous crash doesn't release scheme/PCI/IRQ
U6 Boot timeline /tmp/ missing driver-manager main.rs:24 Writes to /tmp/... without ensuring /tmp exists
U7 Hotplug redundant enumeration hotplug.rs:31-40 Full PCI/ACPI re-scan every 2s

Medium (5)

U8U12: Hotplug unbound device removal bug, ided I/O privilege expect(), serial boot markers blocking 800ms, limited parallelism (50/step), no queue overflow handling.

Low (4)

U13U16: PCI enumeration no timeout, async enumeration no join timeout, boot status command broken if no timeline, no driver health endpoint.


Reference: Modern Hardware Requirements

Sources: Intel 64/IA-32 SDM Vol 3A Ch 8, AMD64 APM Vol 2 Ch 7, ACPI 6.5, Intel x2APIC spec, Linux smpboot.c, Zircon lk_main, seL4 multicore boot.

AP Startup Timing (Intel SDM)

  • INIT deassert → SIPI: 10ms (modern CPUs: can be shorter)
  • SIPI #1 → SIPI #2: 10-300µs (modern: 10µs, legacy: 300µs)
  • AP response timeout: 10 seconds (Linux)
  • ESR check: Clear before each SIPI, read after to verify acceptance

AP Startup Timing (AMD)

  • Similar INIT/SIPI sequence
  • CPUID leaf 0x8000001E for topology (ext_apic_id, core_id, node_id)
  • CPUID leaf 0x1F preferred for V2 extended topology (Intel + newer AMD)
  • APIC ID may exceed 255 → x2APIC mandatory

x2APIC Requirements

  • Mandatory: CPU count > 255 (8-bit APIC ID exhausted)
  • Detection: CPUID.01H:ECX[bit 21]
  • ICR: Single 64-bit MSR write (vs two 32-bit MMIO writes)
  • No delivery status bit: Hardware guarantees delivery
  • Self-IPI: Dedicated MSR 0x83F (fastest single-IPI path)

ACPI MADT Entry Types (ACPI 6.5)

  • Type 0: Processor Local APIC (legacy 8-bit)
  • Type 1: I/O APIC
  • Type 2: Interrupt Source Override
  • Type 4: Local APIC NMI
  • Type 5: Local APIC Address Override
  • Type 9: Processor Local x2APIC (32-bit ID, required for modern hardware)
  • Type 10: Local x2APIC NMI
  • Type 20: Multi-Processor Wakeup Structure (ACPI 6.4+)

Common Firmware Bugs

  1. Duplicate APIC IDs in MADT
  2. Incorrect enabled flags
  3. Missing entries (CPU exists but no MADT entry)
  4. MADT UID / DSDT _UID mismatch
  5. SLIT diagonal != 10 (Linux validates and rejects)
  6. SRAT-SLIT inconsistency

Linux Best Practices

  • Parallel AP bringup (all APs kicked simultaneously) — reduces boot 500ms→100ms on 96-core
  • Adaptive SIPI timing: init_udelay=0 → 10µs for modern CPUs
  • 10-second timeout with schedule() yield loop
  • ESR check after each SIPI, retry up to 2×
  • cpu_callout_mask / cpu_callin_mask handshake

Zircon Best Practices

  • Phased initialization: BSP → topology → AP release → AP init → sync
  • 30-second startup timeout, OOPS (not panic) on timeout
  • Idle threads pre-allocated before releasing APs
  • Init levels coordinate initialization order

seL4 Best Practices

  • Single atomic write releases all APs simultaneously
  • Explicit cache maintenance for ARM32
  • Big kernel lock for simplicity (not scalable)
  • BOOT_BSS section for boot-time variables

Improvement Plan — Patch Series

Priority 0: Fix All Discovered Issues (P15)

P15-1: AP Startup LogicalCpuId Race Fix (Critical K1)

Files: src/acpi/madt/arch/x86.rs Change: Replace CPU_COUNT.load(Relaxed) + LogicalCpuId::new(next_cpu) + CPU_COUNT.fetch_add(1) with single let cpu_id = LogicalCpuId::new(CPU_COUNT.fetch_add(1, SeqCst)). Remove separate load. Move all pre-startup setup (PercpuBlock init, NUMA node set) to between allocation and fetch_add. Risk: Low. Standard atomic fix. Verification: Boot with 4+ CPUs, verify all get unique IDs.

P15-2: AP_READY Sync Consolidation (Critical K2)

Files: src/acpi/madt/arch/x86.rs Change: Replace dual mechanism with single AtomicU8 at TRAMPOLINE+8. AP writes 1 when ready. BSP polls with SeqCst. Add fence(SeqCst) before/after writing trampoline args to ensure AP sees them. Risk: Medium. Changes trampoline protocol. Verification: Boot test on QEMU, verify all APs start correctly.

P15-3: TLB Shootdown Range Race Fix (Critical K3)

Files: src/percpu.rs Change: Pack range into single AtomicU64 (bits [63:32] = start page, bits [31:0] = count). Single atomic swap sets flag + range atomically. Handler unpacks with single load. Risk: Medium. Affects all TLB shootdowns. Verification: Multi-core stress test with frequent mmap/munmap.

P15-4: MCS Lock Memory Ordering (Critical K4)

Files: src/sync/mcs.rs Change: Add fence(Release) after next.store(new_node, Relaxed) at line 55. Add fence(Acquire) before locked.load(Relaxed) at line 59. Change PI donation store to Release. Risk: Low. Standard lock ordering fix. Verification: Multi-threaded contention test.

P15-5: NUMA Node Before CPU Visible (High K8)

Files: src/acpi/madt/arch/x86.rs Change: Move record_apic_mapping() and percpu.numa_node.set() BEFORE CPU_COUNT.fetch_add(). Add fence(SeqCst) between them so scheduler sees NUMA data. Risk: Low. Reordering of operations. Verification: Boot with QEMU SRAT, verify NUMA nodes set before scheduler sees CPUs.

P15-6: Init Dependency Deadlock Fix (Critical U1)

Files: config/redbear-mini.toml, config/redbear-full.toml Change: Add default_dependencies = false to 00_intel-gpiod.service, 00_i2c-dw-acpi.service, 00_i2c-gpio-expanderd.service, 00_i2c-hidd.service, ucsid.service. Add explicit requires_weak for actual dependencies only. Risk: Low. Config-only change. Verification: Boot redbear-mini, verify all services start without deadlock.

P15-7: Service Timeout Mechanism (Critical U2)

Files: recipes/core/base/source/init/src/service.rs, recipes/core/base/source/init/src/scheduler.rs Change: Add timeout_secs: Option<u32> to Notify and Scheme variants. Use set_read_timeout() on INIT_NOTIFY pipe. On timeout, log error and mark service failed. Boot continues. Risk: Medium. Changes init behavior. Verification: Create a service that never notifies, verify boot continues after timeout.

P15-8: Dependency Cycle Detection (Critical U3)

Files: recipes/core/base/source/init/src/scheduler.rs Change: Add BTreeSet<UnitId> visited tracking in load_units(). If a unit ID is already in the visiting set, log cycle error and skip. Risk: Low. Defensive programming. Verification: Create circular dependency in test config, verify detection.

P15-9: Boot Timeline /tmp/ Creation (Medium U6)

Files: local/recipes/system/driver-manager/source/src/main.rs Change: Add let _ = std::fs::create_dir_all("/tmp"); at top of main(), before reset_timeline_log(). Risk: Trivial. Verification: Boot, verify timeline file created.

P15-10: TLB Range Ordering Fix (High K10)

Files: src/percpu.rs Change: Change tlb_flush_start/tlb_flush_count stores from Relaxed to Release. Change handler loads from Relaxed to Acquire. Risk: Low. Ordering fix. Verification: Multi-core TLB stress test.


Priority 1: Stabilize SMP Boot (P16)

P16-1: Calibrated SIPI Delays (High K7)

Files: src/acpi/madt/arch/x86.rs Change: Implement udelay(us) using TSC (calibrated during early boot). Replace spin-count delays: 10ms INIT→SIPI, 10µs SIPI→SIPI for modern CPUs. Reference: Linux wakeup_secondary_cpu_via_init(), Intel SDM Vol 3A §8.4.

P16-2: AP Startup Error Status Check (Medium)

Files: src/acpi/madt/arch/x86.rs Change: After each SIPI, clear and read APIC ESR. If delivery error, retry. Log failure for each CPU. Continue boot with available CPUs. Reference: Linux checks send_status + accept_status.

P16-3: MAX_CPU_COUNT Increase (High K12)

Files: src/cpu_set.rs Change: Increase MAX_CPU_COUNT from 128 to 256. Add boot-time warning if CPUs approach limit.

P16-4: AP Startup Graceful Degradation

Files: src/acpi/madt/arch/x86.rs Change: If AP fails trampoline or AP_READY timeout, log warning, skip CPU, continue boot. Track cpu_online mask separately from cpu_possible.

P16-5: Firmware Bug Detection

Files: src/acpi/madt/mod.rs, src/acpi/mod.rs Change: Add duplicate APIC ID detection during MADT parsing. Add SDT checksum validation (sum all bytes == 0). Log warnings for unknown MADT entry types. Cross-reference MADT entries with SRAT for consistency.


Priority 2: Desktop-Safe Scheduler (P17)

P17-1: NUMA-Aware Work Stealing (Medium K20)

Files: src/context/switch.rs Change: In select_next_context(), prefer contexts on same NUMA node. Use numa::topology().same_node(). Apply penalty for cross-node stealing. Use SLIT distance matrix for weight.

P17-2: Transitive Priority Inheritance (Critical K5)

Files: src/sync/mcs.rs Change: When donating priority to lock holder, check if holder is waiting on another MCS lock. Propagate donation transitively up to 4 levels deep (bounded). Add lock graph cycle detection.

P17-3: CPU Affinity (New Feature)

Files: src/context/context.rs, src/context/switch.rs Change: Add affinity: LogicalCpuSet to Context. Scheduler respects mask. Default: all CPUs. Add sched_setaffinity syscall.

P17-4: Preemption Latency Bounds

Files: src/context/switch.rs Change: Replace hardcoded new_ticks >= 3 with configurable interval. Enforce preempt_locks > 0 guard at context switch. Add preemption-safe lock wrappers.

P17-5: Load Balancing

Files: src/context/switch.rs Change: Periodic (every 100ms) load rebalancing. Migrate tasks from CPUs with >2 runnable to idle CPUs. Use NUMA distance for cross-node decisions.


Priority 3: Harden IPC & Scheme Servers (P18)

P18-1: Daemon Restart Policy (High U4)

Files: recipes/core/base/source/init/src/service.rs, recipes/core/base/source/init/src/scheduler.rs Change: Add restart = "on-failure" | "always" | "never" to service config. Implement exponential backoff: 1s → 2s → 4s → 8s → 30s max. Track restart count, give up after 5 consecutive failures.

P18-2: Process Monitoring & Cleanup (High U5)

Files: local/recipes/system/driver-manager/source/src/config.rs Change: Non-blocking waitpid(WNOHANG) poll in hotplug loop. On driver exit: release scheme, unbind PCI device, free IRQ. Notify init of failure.

P18-3: Bounded Scheme Request Queues (Medium)

Change: Add configurable queue depth limit to scheme daemons. When full, return EBUSY. Prevents memory exhaustion.

P18-4: Watchdog/Health Monitoring (High)

Change: Optional health-check ping in scheme protocol. Init checks critical services every 5s. On failure, restart per restart policy.


Priority 4: Stress-Test Userspace Drivers (P19)

P19-1: Multi-Core Driver Stress Test

Change: Parallel I/O to ided/ahcid/nvmed + network e1000d + input evdevd. Verify no panics, no hangs, no data corruption over 1 hour.

P19-2: GPU Parallel Submission

Change: Multiple processes submit to virtio-gpu / redox-drm simultaneously. Verify fencing correctness, no GPU hang.

P19-3: USB Hotplug Under Load

Change: Rapid device connect/disconnect while transferring data via usbscsid. Verify cleanup and no resource leaks.

P19-4: Hotplug Stress Test

Change: QEMU virtio device hot-add/hot-remove while system under load. Verify driver-manager handles changes correctly.


Estimated Effort

Priority Patches Lines Time
P0 (Fix discovered) P15-1 through P15-10 ~800 2-3 days
P1 (SMP stabilize) P16-1 through P16-5 ~500 2-3 days
P2 (Scheduler) P17-1 through P17-5 ~1200 5-7 days
P3 (IPC harden) P18-1 through P18-4 ~800 3-5 days
P4 (Stress test) P19-1 through P19-4 ~600 2-3 days
Total 24 patches ~3900 14-21 days

Acceptance Criteria

  • All Critical and High issues resolved
  • Boot to login prompt in <10s on QEMU (4 cores)
  • No panics under 72-hour stress test (4 cores, all driver types)
  • AP startup race-free with 128 simulated CPUs
  • NUMA topology correctly discovered from QEMU SRAT
  • Service restart within 5 seconds of crash
  • No priority inversion >100ms under load
  • All patches in local/patches/kernel/, wired into recipe.toml
  • Boot-tested on QEMU UEFI with scripts/run_mini.sh

Dependency Graph

P15-1 (CPU_COUNT race) ─────┐
P15-2 (AP_READY sync) ──────┤
P15-3 (TLB range race) ─────┤
P15-4 (MCS ordering) ───────┼──→ P16-1 (SIPI timing)
P15-5 (NUMA ordering) ──────┤    P16-2 (ESR check)
P15-10 (TLB ordering) ──────┘    P16-3 (MAX_CPU)
                                   P16-4 (graceful degradation)
P15-6 (init deadlock) ──────────→ P16-5 (firmware bugs)
P15-7 (service timeout)
P15-8 (cycle detection)
P15-9 (/tmp creation)

P16-* ──→ P17-1 (NUMA work stealing)
          P17-2 (transitive PI)
          P17-3 (CPU affinity)
          P17-4 (preemption)
          P17-5 (load balancing)

P17-* ──→ P18-1 (restart policy)
          P18-2 (crash cleanup)
          P18-3 (bounded queues)
          P18-4 (watchdog)

P18-* ──→ P19-* (stress tests)