Files
RedBear-OS/local/docs/COMPREHENSIVE-SYSTEM-ASSESSMENT-AND-IMPROVEMENT-PLAN.md
T
vasilito ae46dabeb0 docs: Add comprehensive system assessment and improvement plan
Replace 5 stale planning docs with unified assessment:
- New: COMPREHENSIVE-SYSTEM-ASSESSMENT-AND-IMPROVEMENT-PLAN.md
  (12-subsystem audit vs Linux 7.1, 6 phases of work)
- Removed: IMPLEMENTATION-MASTER-PLAN, SUBSYSTEM-ASSESSMENT-2026-05,
  SMP-BOOT-HARDENING-PLAN, CPU-DMA-IRQ-MSI-SCHEDULER-FIX-PLAN,
  COMPREHENSIVE-BOOT-IMPROVEMENT-PLAN
2026-05-20 13:47:25 +03:00

36 KiB
Raw Blame History

Red Bear OS — Comprehensive System Assessment & Improvement Plan

Version: 1.0 (2026-05-20) Reference: Linux kernel 7.1 (local/reference/linux-7.1/) Supersedes: IMPLEMENTATION-MASTER-PLAN.md, SUBSYSTEM-ASSESSMENT-2026-05.md, SMP-BOOT-HARDENING-PLAN.md, CPU-DMA-IRQ-MSI-SCHEDULER-FIX-PLAN.md, COMPREHENSIVE-BOOT-IMPROVEMENT-PLAN.md

Canonical adjacent plans (remain authoritative for subsystem detail):

  • ACPI-IMPROVEMENT-PLAN.md — ACPI waves W0W7
  • IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md — PCI/IRQ/MSI-X
  • USB-IMPLEMENTATION-PLAN.md — USB phases U0U6
  • CONSOLE-TO-KDE-DESKTOP-PLAN.md — desktop path
  • DRM-MODERNIZATION-EXECUTION-PLAN.md — GPU stack

1. Executive Summary

Red Bear OS is architecturally sound but has significant gaps in hardware-facing subsystems. The system boots to a login prompt in QEMU with working console, networking, and basic device enumeration. However, the boot log and codebase audit reveal that bare-metal usability is limited: the system runs hot (no C-states, no thermal backend), may not see all CPU cores (AP startup races), may lose USB keyboard (only xHCI exists), and has minimal observability for operators.

This document is a truthful, evidence-based assessment of every low-level subsystem, grounded in source code inspection, boot log analysis, and comparison against Linux 7.1 reference source. It replaces five stale/duplicate planning documents with one canonical assessment and forward plan.

Bottom-line verdicts

Subsystem Verdict
SMP Real in kernel, but AP startup races and no bare-metal validation
CPU power (C-states) Completely missing — root cause of heat on bare metal
CPU power (P-states) Partial — cpufreqd exists but fragile
Thermal / sensors Daemon exists but no backend — runs with empty surface
ACPI boot Boot-baseline complete, not release-grade
ACPI thermal/fan Missing — not implemented in acpid
USB xHCI Real, QEMU-validated only
USB EHCI/UHCI/OHCI No drivers exist — bare-metal USB keyboard unreliable
PCI / IRQ / MSI-X Architecturally strong, low adoption in drivers
IOMMU AMD-Vi Real, QEMU first-use proof only
IOMMU Intel VT-d Missing — orphaned DMAR parsing only
Firmware loading Real, on-demand, async
Memory management Basic frame allocator — no swap/NUMA/hotplug
Logging Append-only /var/log/system.log — no rotation/structured storage
Udev Real but limited — polling hotplug, hardcoded rules

2. Assessment by Subsystem

2.1 SMP / CPU Bring-up

Status: 🟡 Implemented, QEMU-proven, bare-metal unvalidated Linux 7.1 equivalent: arch/x86/kernel/smpboot.c, arch/x86/kernel/apic/, kernel/smp.c

What is real

The kernel has a complete AP bring-up path:

  • AP trampoline with INIT/SIPI sequencing (madt/arch/x86.rs)
  • x2APIC/LocalApic branching with zero-extended ID fallback (local_apic.rs)
  • multi_core feature enabled by default (Cargo.toml)
  • Per-CPU data structures (percpu.rs)
  • IPI support for TLB shootdowns and scheduler wakeups
  • CPU set tracking (cpu_set.rs)

Source files inspected:

  • recipes/core/kernel/source/src/acpi/madt/arch/x86.rs
  • recipes/core/kernel/source/src/arch/x86_shared/device/local_apic.rs
  • recipes/core/kernel/source/src/startup/mod.rs
  • recipes/core/kernel/source/src/cpu_set.rs

Why you see "SMP: 1 CPUs online"

The boot log shows:

kernel::acpi::madt::arch:INFO -- SMP: 1 CPUs online (max 256)

This can happen for three reasons:

  1. QEMU i440fx exposes only 1 vCPU to the guest (most likely in this boot)
  2. AP startup timeoutAP_SPIN_LIMIT=1_000_000 spin counts vary by clock speed; on slow or heavily loaded bare metal, APs may not signal readiness in time
  3. Firmware MADT only exposes 1 processor entry — rare but possible on broken firmware

On real bare metal with an AMD Ryzen or Intel Core system, if the firmware exposes multiple LocalApic entries and AP startup succeeds, the kernel will bring up all cores. But this has never been validated on the project's hardware matrix.

Critical weaknesses (38 kernel issues found)

SMP-BOOT-HARDENING-PLAN.md (2026-05-16) documented 54 issues across kernel and userspace boot. The most critical kernel-side items are:

Issue Severity File Description
AP startup LogicalCpuId race Critical madt/arch/x86.rs:153,244,276,365 Two APs load CPU_COUNT simultaneously → same ID
AP_READY dual-mechanism race Critical madt/arch/x86.rs:174-225 Trampoline u64 write + static AtomicBool — inconsistent ordering
TLB shootdown range race Critical percpu.rs:134-137 Concurrent shootdowns overwrite range between flag set and IPI
MCS lock missing fences Critical sync/mcs.rs:74-101 No Release/Acquire on MCS lock handoff
Unbounded priority inversion Critical sync/mcs.rs:126-145 PI donation one level only
Scheduler panic flag leak Critical switch.rs:164,298 in_context_switch stays true on panic → CPU lockup
Missing SIPI delays High madt/arch/x86.rs:192-337 Spin-count delays, not TSC-based. Intel SDM requires 10ms INIT→SIPI
NUMA node set after CPU visible High madt/arch/x86.rs:244,253 CPU_COUNT.fetch_add() before numa_node.set()
MAX_CPU_COUNT=128 too small High cpu_set.rs:44 AMD EPYC has 128C/256T, Threadripper PRO 96C/192T
Global IRQ count lock High scheme/irq.rs:67 COUNTS.lock() is global spinlock on hot path

These are not theoretical. The LogicalCpuId race means two APs can claim the same CPU ID, leading to corrupted per-CPU data. The missing SIPI delays mean APs may fail to start on real hardware with strict firmware timing requirements.

Gaps vs Linux 7.1

Feature Linux 7.1 Red Bear
Robust AP bring-up smpboot.c with TSC delays, online checks Spin-count delays, race conditions
CPU hotplug Full hot-add/hot-remove Not implemented
CPU isolation isolcpus, nohz_full Not implemented
NUMA Node-aware scheduling, memory policies No NUMA awareness
Per-CPU idle threads cpuhp/, idle thread per CPU APs enter idle loop directly
x2APIC fallback Clean fallback with explicit disable Fallback works but warns

Verdict: SMP infrastructure is real but has critical races that must be fixed before bare-metal multi-core can be trusted. No hardware validation exists.


2.2 CPU Power Management (P-states / C-states)

Status: 🟡 P-states partial, C-states missing entirely Linux 7.1 equivalent: drivers/cpufreq/, drivers/cpuidle/, drivers/acpi/processor.c, arch/x86/kernel/acpi/cstate.c

P-states (frequency scaling)

cpufreqd is a real userspace daemon that:

  • Reads ACPI _PSS (Performance States) tables
  • Samples CPU load periodically
  • Writes IA32_PERF_CTL MSR to change P-state
  • Supports governors: Ondemand, Performance, Powersave
  • Exposes /scheme/cpufreq

Source: local/recipes/system/cpufreqd/source/src/main.rs

But it is fragile:

  1. write_msr() ignores its msr parameter and writes only the value to /dev/cpu/<n>/msr. This suggests it depends on a Linux-style MSR driver that uses file offset as the MSR index. No such driver was found in the Red Bear tree.
  2. The daemon reads MSR temperature via IA32_THERM_STATUS but has no actionable thermal policy — it can request "powersave" from cpufreqd itself, but there is no thermal trip point logic.
  3. On the boot log: cpufreqd: CPU0: 4 P-states (2400 - 1200 kHz) followed by cpufreqd: CPU0: MSR write failed (1/1)the P-state change is failing.

C-states (idle power states)

This is completely missing and is the single largest contributor to system heat on bare metal.

What exists:

  • The kernel has a normal hlt instruction in the idle loop when no threads are runnable
  • No dedicated cpuidle subsystem
  • No ACPI _CST (C-state) table parsing
  • No mwait / monitor usage for deeper C-states
  • No C1E, C3, C6, C7 support

What Linux 7.1 has:

  • drivers/cpuidle/ with multiple drivers: acpi_idle, intel_idle, amd_idle
  • _CST table parsing in ACPI processor driver
  • mwait hint selection based on C-state depth
  • Latency and power measurements per C-state
  • Scheduler integration: cpuidle_enter() called from idle loop

Verdict: cpufreqd is real but MSR writes are failing. C-states are completely absent. On bare metal, CPUs run at full power even when idle. This is why the system is "very hot."


2.3 Thermal Management / Sensors / Hardware Monitoring

Status: 🔴 Thermal daemon exists but no backend; sensors missing; hwmon absent Linux 7.1 equivalent: drivers/thermal/, drivers/hwmon/, drivers/acpi/thermal.c, drivers/acpi/fan.c

thermald

thermald is real code, not a stub. It:

  • Attempts to read ACPI thermal zones
  • Reads CPU MSR temperature (IA32_THERM_STATUS)
  • Can request powersave from cpufreqd
  • Can request ACPI sleep
  • Exposes /scheme/thermal

Source: local/recipes/system/thermald/source/src/main.rs

But it runs with an empty surface:

  • ACPI thermal zone enumeration is missing from acpid. The ACPI daemon's scheme surface (/scheme/acpi) has no thermal or fan nodes.
  • thermald expects /scheme/acpi/thermal and /scheme/acpi/fan to exist, but they do not.
  • fan.rs exists in the thermald source tree but is orphaned — it is not wired into main.rs (mod fan; is absent).

The boot log shows:

[  OK   ] Started Thermal management daemon
2026-05-20T09-13-44.583Z [@thermald:19 INFO] thermald: started

And then nothing. No thermal zones found, no temperature readings, no fan control.

Hardware sensors (hwmon)

There is no hwmon infrastructure in Red Bear OS.

What is missing:

  • No /sys/class/hwmon equivalent
  • No /scheme/hwmon
  • No sensor drivers

Linux 7.1 has 100+ hwmon drivers covering:

  • CPU temperature: coretemp (Intel), k10temp (AMD)
  • Motherboard sensors: nct6775, it87, f71882fg
  • Voltage regulators: ina2xx, ltc2947
  • Fan speed monitors: various Super-I/O chips

Red Bear has none of these.

SMBIOS / DMI

SMBIOS parsing exists in acpid/src/dmi.rs, but the boot log shows:

2026-05-20T09-12-40.920Z [@acpid::dmi:124 WARN] SMBIOS entry point not found in 0xF0000-0xFFFFF

This means DMI-based quirks and system identification are best-effort only. On systems without a valid SMBIOS entry point, the quirk system falls back to PCI/USB device ID matching only.

Verdict: thermald is real but powerless. No hwmon, no sensor drivers, no ACPI thermal backend. The system has zero thermal awareness.


2.4 ACPI Stack

Status: 🟡 Boot-baseline complete, not release-grade Linux 7.1 equivalent: drivers/acpi/, include/acpi/

What is strong

  • Kernel early ACPI discovery: RSDP, RSDT, XSDT
  • MADT parsing: LocalApic, IoApic, IntSrcOverride, NMI
  • x2APIC fallback with zero-extended IDs
  • FADT parsing, PM1a/PM1b register access
  • AML interpreter v6.1.1 with real mutex tracking
  • EC (Embedded Controller) byte-transaction access
  • _S5 shutdown derivation (though timing is fragile)
  • kstop kernel shutdown eventing consumed by redbear-sessiond
  • DMI exposure via /scheme/acpi/dmi

Source files:

  • recipes/core/kernel/source/src/acpi/
  • recipes/core/base/source/drivers/acpid/src/

What is weak

Area Status Detail
acpid startup Fragile Active panic-grade expect() paths on firmware-origin data
_S5 timing Fragile Derived after PCI registration; pre-PCI shutdown reports "AML not ready"
DMAR Orphaned Parsing exists in acpid/src/dmar/mod.rs but not wired; Intel VT-d has no owner
Sleep beyond S5 Missing set_global_s_state() is S5-only; S3 suspend not validated
Thermal zones Missing No ACPI thermal zone enumeration
Fan devices Missing No ACPI fan device support
Battery/power Provisional power_snapshot() does real AML-backed probing but bootstrap preconditions are weak
AML fault handling Partial aml_physmem.rs has "log then fabricate 0" paths
SMBIOS Best-effort Entry point missing on many systems

The ACPI improvement plan (ACPI-IMPROVEMENT-PLAN.md) tracks 8 waves of work (W0W7). Current status:

  • W0 (Contracts): partially complete
  • W1 (Startup hardening): partially complete
  • W2 (AML ordering/shutdown): partially complete
  • W3 (Honest power surface): open
  • W4 (Physmem/EC/fault): partially complete
  • W5 (Ownership cleanup): open
  • W6 (Consumer integration): partially complete
  • W7 (Validation closure): open

Verdict: ACPI is the most mature low-level subsystem, but it is still boot-baseline complete, not release-grade. Thermal and fan support are completely absent.


2.5 PCI / IRQ / MSI-X

Status: 🟡 Architecturally strong, adoption-incomplete Linux 7.1 equivalent: drivers/pci/, arch/x86/kernel/apic/, drivers/iommu/

What is real

  • pcid enumerates PCI devices via config space (I/O ports 0xCF8/0xCFC fallback when no ECAM/MCFG)
  • Capability parsing: MSI, MSI-X, power management, vendor-specific
  • driver-manager matches TOML configs by bus/class/vendor and spawns drivers
  • Kernel MSI message composition and validation (msi.rs, vector.rs)
  • MSI-X table mapping and vector allocation
  • redox-driver-sys provides IRQ handle abstractions, affinity helpers
  • IOAPIC routing with interrupt source overrides
  • Legacy PIC fallback

Source files:

  • recipes/core/base/source/drivers/pcid/
  • local/recipes/system/driver-manager/
  • recipes/core/kernel/source/src/arch/x86_shared/device/msi.rs
  • local/recipes/drivers/redox-driver-sys/source/src/irq.rs

What is weak

Issue Detail
Legacy IRQ dominance e1000d and ided still use legacy IRQ (IRQ 11, IRQ 14/15)
MSI-X adoption Only ixgbed and GPU paths use MSI-X; most drivers on legacy INTx
IOMMU MSI gate iommu_validate_msi_irq() is a stub — always returns true
IRQ affinity Available in API but not widely used
pcid helper fragility Some paths still treat malformed capabilities as invariants
Hardware validation MSI-X proven in QEMU only; no real hardware vector validation

The IRQ/low-level plan (IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md) correctly identifies that the architecture is sound but the runtime proof is thin. Priority 1 is "MSI-X runtime validation on real devices."

Verdict: The PCI/IRQ substrate is one of the strongest parts of the stack, but it is not yet release-grade because MSI-X is not widely adopted and hardware validation is missing.


2.6 IOMMU / DMA

Status: 🟡 AMD-Vi real but unvalidated; Intel VT-d missing Linux 7.1 equivalent: drivers/iommu/amd/, drivers/iommu/intel/, drivers/iommu/dma-iommu.c

AMD-Vi

The iommu daemon is real, not a stub:

  • AmdViUnit::init() maps MMIO, programs device tables, command buffer, event log, interrupt remap table (IRTE)
  • QEMU first-use proof passes: discovers units, initializes, drains events
  • Self-test path exists: redbear-phase-iommu-check

Source: local/recipes/system/iommu/source/src/amd_vi.rs

But:

  • The boot log shows: iommu: no AMD-Vi units found (source=none, kernel_acpi_status=empty, ivrs_path=none)
  • This happens because the IVRS table is absent on this platform (QEMU i440fx does not provide IVRS)
  • When zero units are found, the daemon registers scheme:iommu and exits
  • Real AMD hardware validation: NONE

Intel VT-d

  • DMAR parsing exists in acpid/src/dmar/mod.rs but is orphaned
  • No Intel VT-d runtime daemon
  • No DMA remapping for Intel platforms
  • iommu daemon is AMD-Vi only

DMA integration

  • DMA allocation exists in redox-driver-sys
  • But IOMMU integration is incomplete: iommu_validate_msi_irq() is a no-op, and there is no enforced DMA map/unmap with IOMMU translation
  • Linux 7.1 has dma-iommu.c which handles IOMMU-aware DMA mapping for all devices behind an IOMMU

Verdict: AMD-Vi is implemented but unvalidated. Intel VT-d is missing. DMA/IOMMU integration is incomplete.


2.7 USB Stack

Status: 🟡 xHCI real but QEMU-only; EHCI/UHCI/OHCI missing Linux 7.1 equivalent: drivers/usb/host/, drivers/usb/core/, drivers/hid/usbhid/

xHCI

The xHCI driver (xhcid) is real and substantial:

  • ~6,000 lines of Rust
  • 88+ error handling fixes applied via Red Bear patch
  • Interrupt-driven path restored (MSI/MSI-X/INTx)
  • Event ring growth implemented (ring doubling)
  • BOS/SuperSpeed descriptor fetching
  • Speed detection for hub children
  • USB 3 hub endpoint configuration
  • Suspend/resume API skeleton

Source: recipes/core/base/source/drivers/usb/xhcid/

But:

  • Only QEMU-validated — no real hardware testing
  • ~57 TODO/FIXME comments remain
  • Some panic!() sites remain in device enumerator

Missing host controllers

No EHCI, UHCI, or OHCI drivers exist in the Red Bear tree.

Controller Speed Why it matters
EHCI USB 2.0 High Speed Most USB 2.0 keyboards/mice
OHCI USB 1.1 Full/Low Speed AMD/VIA legacy USB
UHCI USB 1.1 Full/Low Speed Intel legacy USB

Linux 7.1 has full implementations for all three:

  • drivers/usb/host/ehci-hcd.c (~4,500 lines)
  • drivers/usb/host/ohci-hcd.c (~3,500 lines)
  • drivers/usb/host/uhci-hcd.c (~2,800 lines)

The USB implementation plan honestly states:

"External USB keyboard input is reliably available only when the keyboard is reached through the xHCI -> usbhubd/usbhidd -> inputd path."

On many bare-metal systems, USB keyboards route through EHCI or OHCI, not xHCI. Red Bear cannot claim reliable USB keyboard boot fallback.

Class drivers

Driver Status Quality
usbhubd Real Good — interrupt-driven change detection, graceful per-port errors
usbhidd Real Good — HID report parsing, named producers, no panics in loop
usbscsid Real Good — BOT transport, stall recovery, ReadCapacity16

Verdict: xHCI is real but QEMU-only. The absence of EHCI/UHCI/OHCI is a critical bare-metal gap.


2.8 Firmware Loading

Status: 🟢 Real and functional Linux 7.1 equivalent: drivers/base/firmware_loader/

The firmware-loader daemon is one of the most complete subsystems:

  • On-demand blob loading via scheme:firmware
  • Indexes /lib/firmware at startup
  • Persistent cache with fallback chains
  • Async request_firmware_nowait() with timeout and retry
  • Emits uevents for consumers
  • Read-only scheme with mmap support

Source: local/recipes/system/firmware-loader/source/

The boot log does not show firmware loading activity because no device requested firmware during this boot (no GPU, no Wi-Fi).

Verdict: This subsystem is production-ready architecturally. Needs hardware validation when GPU/Wi-Fi drivers are active.


2.9 Memory Management

Status: 🟡 Basic but functional; advanced features missing Linux 7.1 equivalent: mm/, arch/x86/mm/

What is real

  • Frame allocator / buddy-like free list
  • Kernel page-table setup (4-level on x86_64)
  • Device-memory mapping for MMIO
  • Explicit memory-region handling
  • Early boot memory map parsing from ACPI/firmware
  • 7,092 MB detected in boot log

Source:

  • recipes/core/kernel/source/src/memory/mod.rs
  • recipes/core/kernel/source/src/startup/memory.rs

What is missing

Feature Linux 7.1 Red Bear
Swap Full swap with page reclaim Not implemented
NUMA Node-aware allocation, migrate pages No NUMA awareness
Memory hotplug Add/remove memory at runtime Not implemented
Reclaim/compaction kswapd, memory pressure handling Not implemented
OOM killer out_of_memory() kills processes Not implemented
Huge pages THP, hugetlbfs Not implemented
Memory cgroups memcg resource limits Not implemented
Demand paging Lazy allocation on fault Basic but no swap backing

Verdict: Sufficient for current boot and userspace needs, but not production-grade for memory-intensive workloads.


2.10 Logging Infrastructure

Status: 🟡 Basic append-only; no rotation, no structured storage Linux 7.1 equivalent: No direct equivalent; compare to systemd-journald, rsyslog, syslog-ng

What is real

  • logd daemon serves scheme:log
  • Persists to /var/log/system.log
  • prepends startup banner, backfills new sinks
  • Mirrors kernel log input
  • relibc syslog API (syslog(), openlog()) writes to /scheme/log

Source:

  • recipes/core/base/source/logd/src/main.rs
  • recipes/core/base/source/logd/src/scheme.rs

What is weak

Issue Detail
Append-only /var/log/system.log grows forever
No rotation No size-based or time-based truncation
No retention Old logs never deleted
No structured format Plain text only; no JSON or binary journal
read path TODO scheme.rs has a TODO for reading log history
Console dominance Most daemon output still goes to console timestamps
No per-service logs All logs in one file

The boot log shows console timestamps because daemons write to stderr, which init captures and logs. The persistent /var/log/system.log exists but is append-only with no management.

Verdict: Functional for debugging but not suitable for production observability. Needs rotation, structured format, and per-service separation.


2.11 Udev / Device Discovery

Status: 🟡 Real but limited Linux 7.1 equivalent: drivers/base/core.c, lib/kobject_uevent.c, udev/

What is real

udev-shim is a real implementation, not a placeholder:

  • Enumerates PCI devices via pcid scheme
  • Classifies devices by class/subclass/vendor
  • Creates /dev nodes and symlinks
  • Writes /etc/udev/rules.d/50-default.rules
  • Exposes scheme:udev
  • Polls for changes (not event-driven)

Source: local/recipes/system/udev-shim/source/

The boot log shows:

[  OK   ] Started udev compatibility shim
[INFO] udev-shim: enumerated 1 PCI device(s)
[INFO] udev-shim: wrote default rules to /etc/udev/rules.d/50-default.rules

What is weak

Issue Detail
Hardcoded rules Only 3 rules: net naming (enp*), NVMe by-id, SATA by-id
Polling hotplug Polls every N seconds; not event-driven like Linux udev/netlink
No rules engine Cannot parse Linux udev rules; rules are compiled-in
libudev-stub TODO local/recipes/libs/libudev-stub/recipe.toml explicitly marked TODO
Limited coverage Only PCI devices; no USB, no ACPI, no platform devices
No persistent db Device state not saved across reboots

Linux 7.1 udev:

  • Event-driven via netlink NETLINK_KOBJECT_UEVENT
  • Full rules engine with MATCH, ACTION, ENV, RUN
  • Persistent database in /run/udev/
  • udevadm tool for querying and triggering
  • Integrates with systemd for device units

Verdict: Functional for basic PCI device naming but far from a full udev replacement. Polling hotplug is inefficient.


2.12 Input Stack

Status: 🟡 Real but uneven quality Linux 7.1 equivalent: drivers/input/, drivers/hid/, drivers/serio/

What is real

Component Status Detail
ps2d Real PS/2 keyboard + mouse; kernel serio byte queues
usbhidd Real HID report parsing, named producers
inputd Real Producer/consumer scheme, VT switching, keymaps
evdevd Real evdev scheme, orbclient→evdev translation
i2c-hidd Real ACPI PNP0C50 scan, _CRS parsing
intel-thc-hidd Partial PCI init works; main loop sleeps 5s — no input streaming

The boot log shows PS/2 and evdev working:

[  OK   ] Started PS/2 driver
[  OK   ] Started Evdev input daemon
[INFO] evdevd: registered scheme:evdev

Gaps vs Linux 7.1

Gap Severity Linux Reference
intel-thc-hidd no streaming High drivers/hid/intel-thc-hid/ full probe+report
No multitouch/ABS_MT High drivers/input/input-mt.c
No libinput acceleration High libinput: velocity curves, palm detection
No PS/2 extended protocols Medium libps2.c ImPS/2 scroll, Explorer 5-btn
No HID quirks table Medium hid-quirks.c 4000+ entries
No input hotplug Medium udev + inotify on /dev/input/

Verdict: The input stack exists and works for basic keyboard/mouse. Touch and advanced HID are incomplete.


3. Root Cause Analysis

Why the system runs hot on bare metal

  1. No C-state management → CPUs never enter low-power idle states (C1, C1E, C3, C6, C7). They spin in the kernel idle loop at full power.
  2. No ACPI thermal zonesacpid does not enumerate thermal zones, so thermald has no temperature data to act on.
  3. No hwmon sensor drivers → No temperature sensors are readable. The system is "flying blind."
  4. No ACPI fan control → Fan devices are not enumerated, so thermald cannot turn on cooling.
  5. cpufreqd MSR writes failing → Even P-state throttling is not working reliably (MSR write failed in boot log).

Fix priority: C-states (immediate heat reduction) > ACPI thermal zones (enables thermald) > hwmon sensors (operator visibility) > fan control (active cooling).

Why only 1 CPU shows online

  1. QEMU i440fx exposes only 1 vCPU by default (most likely in the provided boot log)
  2. AP startup races — LogicalCpuId race, missing SIPI delays, AP_READY dual mechanism can cause APs to fail startup on real hardware
  3. MAX_CPU_COUNT=128 too small for high-core-count AMD EPYC
  4. No bare-metal validation means we don't know which of these is the real blocker on actual hardware

Why USB keyboard may not work on bare metal

  1. Only xHCI exists — no EHCI/UHCI/OHCI drivers
  2. Many systems route USB 2.0 keyboards through EHCI
  3. Some AMD/VIA systems use OHCI for legacy ports
  4. Some Intel systems use UHCI for legacy ports
  5. No companion controller support to route low-speed devices from EHCI to xHCI

4. Honest Status Matrix

Subsystem Status Linux 7.1 Parity Evidence Class
SMP bring-up 🟡 Partial ~30% Source + QEMU; bare metal unvalidated
C-states (cpuidle) 🔴 Missing 0% No subsystem exists
P-states (cpufreq) 🟡 Partial ~20% Daemon real but MSR writes failing
Thermal management 🔴 Missing backend ~10% thermald exists but no ACPI backend
Hardware sensors (hwmon) 🔴 Missing 0% No infrastructure, no drivers
ACPI boot / shutdown 🟢 Baseline ~40% Boots, shutdown works, sleep partial
ACPI thermal / fan 🔴 Missing 0% Not implemented in acpid
PCI enumeration 🟢 Working ~60% Real, robust, driver-manager binds
MSI/MSI-X infrastructure 🟡 Real ~40% Kernel real, driver adoption low
IOMMU AMD-Vi 🟡 Real, unvalidated ~30% QEMU proof only
IOMMU Intel VT-d 🔴 Missing 0% Orphaned DMAR parsing only
USB xHCI 🟡 Real, QEMU-only ~30% No hardware validation
USB EHCI/UHCI/OHCI 🔴 Missing 0% No drivers
Firmware loading 🟢 Real ~70% On-demand, async, validated in build
Memory management 🟡 Basic ~30% Frame allocator; no swap/NUMA/hotplug
Logging 🟡 Basic ~20% Append-only, no rotation
Udev 🟡 Limited ~25% Polling, hardcoded rules
Input (PS/2, USB HID) 🟢 Working ~50% Real but touch/advanced HID missing
Input (I2C HID, THC) 🟡 Partial ~20% i2c-hidd real; intel-thc-hidd non-functional
D-Bus system bus 🟢 Working ~60% Real, services wired
D-Bus session bus 🟡 Partial ~30% Partially wired
Network (wired) 🟢 Working ~60% e1000d, virtio-net work
Network (Wi-Fi) 🟡 Host-tested ~20% Intel stack builds; no hardware validation
Bluetooth 🟡 Experimental ~15% BLE controller probe works; limited

5. New Improvement Plan

This plan is ordered by impact on bare-metal usability and dependency chain. Earlier phases unblock later ones.

Phase 1: Bare-Metal Boot Hardening (68 weeks)

Goal: Boot reliably on diverse bare metal with all cores, reasonable temperature, and working USB keyboard.

1.1 Fix SMP AP Startup (2 weeks)

  • Fix K1 (LogicalCpuId race) — use fetch_add before AP reads ID
  • Fix K2 (AP_READY dual mechanism) — consolidate to single atomic
  • Fix K7 (missing SIPI delays) — add TSC-based 10ms INIT→SIPI delay per Intel SDM
  • Increase MAX_CPU_COUNT to 256
  • Validate on AMD Ryzen and Intel Core bare metal
  • Capture boot log showing SMP: N CPUs online where N > 1

1.2 Implement Basic C-states (2 weeks)

  • Add cpuidle framework in kernel: idle state table, enter/exit hooks
  • Parse ACPI _CST table in acpid, expose via /scheme/acpi/cstates
  • Implement hlt-based idle (C1) — immediate heat reduction
  • Add mwait-based C1E/C3 for Intel; add AMD C1E support
  • Wire to scheduler idle path: call cpuidle_enter() when no runnable threads
  • Validate temperature drop on bare metal

1.3 Enable ACPI Thermal Zones (2 weeks)

  • Add thermal zone enumeration to acpid (_TZ namespace walk)
  • Expose /scheme/acpi/thermal with zone temperatures and trip points
  • Wire thermald to read from /scheme/acpi/thermal
  • Add passive cooling policy: throttle cpufreqd when trip point exceeded
  • Add ACPI fan device support (_FAN objects)
  • Wire thermald fan control

1.4 Add Basic Sensor Drivers (2 weeks)

  • Create scheme:hwmon or extend /scheme/acpi/thermal
  • Port coretemp driver (Intel CPU temperature MSR)
  • Port k10temp driver (AMD CPU temperature MSR)
  • Add temperature readout to redbear-info
  • Validate sensor readings on bare metal

Phase 2: USB Completeness (46 weeks)

Goal: USB keyboard and storage work on all bare metal.

2.1 EHCI Host Controller (3 weeks)

  • Implement EHCI HCD based on Linux drivers/usb/host/ehci-hcd.c
  • Support USB 2.0 high-speed keyboards, mice, storage
  • Integrate with driver-manager config
  • Validate on Intel and AMD bare metal

2.2 OHCI/UHCI Fallback (2 weeks)

  • Implement OHCI for AMD/VIA systems
  • Implement UHCI for Intel legacy systems
  • Add companion controller topology support

2.3 USB Boot Resilience (1 week)

  • Ensure USB keyboard available before login prompt on all profiles
  • Add USB storage boot support
  • Hot-plug stress testing on real hardware

Phase 3: IRQ / IOMMU / MSI-X Hardening (46 weeks)

Goal: Production-grade interrupt and DMA safety.

3.1 MSI-X Adoption (2 weeks)

  • Migrate e1000d to MSI-X
  • Migrate ided to MSI-X (or document legacy-IRQ-only rationale)
  • Add MSI-X fallback logging to all PCI drivers
  • Validate on real hardware

3.2 IOMMU Hardware Validation (2 weeks)

  • AMD-Vi validation on real AMD hardware
  • Implement Intel VT-d daemon (migrate from orphaned acpid DMAR)
  • Replace iommu_validate_msi_irq() stub with real validation
  • DMA map/unmap with IOMMU translation

3.3 IRQ Quality (2 weeks)

  • IRQ affinity validation per driver
  • Interrupt coalescing for network/storage
  • Spurious IRQ accounting improvement

Phase 4: Observability & Logging (24 weeks)

Goal: Operator can diagnose system health.

4.1 Structured Logging (2 weeks)

  • Add JSON-structured log format option to logd
  • Per-service log files in /var/log/<service>/
  • Size-based log rotation (e.g., 10 MB per file)
  • Time-based log retention (e.g., 7 days)

4.2 Udev Rules Engine (2 weeks)

  • Replace hardcoded rules with subset of Linux udev rules parser
  • Event-driven hotplug via scheme notifications (replace polling)
  • Persistent device database across reboots

4.3 System Health Dashboard (1 week)

  • redbear-info thermal/CPU/fan display tab
  • Boot timeline persistence across switchroot
  • Real-time CPU/memory/network metrics

Phase 5: Hardware Validation Matrix (46 weeks)

Goal: Evidence-based support claims.

5.1 Define Validation Targets

Minimum 4 hardware classes:

  1. AMD desktop (Ryzen, discrete GPU)
  2. Intel desktop (Core, integrated GPU)
  3. AMD laptop (Ryzen mobile)
  4. Intel laptop (Core mobile)

5.2 Per-Target Checklist

For each target, validate and record:

  • Boots to login prompt
  • All CPU cores online (SMP: N CPUs online matches hardware)
  • USB keyboard works at boot
  • USB storage mounts
  • Network (wired) obtains DHCP lease
  • Temperature readable via redbear-info
  • Shutdown succeeds cleanly
  • Reboot succeeds cleanly

5.3 Negative-Result Capture

  • Document failures per target (e.g., "AMD X670E: AP startup timeout", "Intel Raptor Lake: SMBIOS missing")
  • Update this assessment with validation evidence

Phase 6: Desktop Stack Continuation (Parallel)

Goal: Continue the CONSOLE-TO-KDE path on top of hardened substrate.

This phase is orthogonal to the low-level work above. It depends on:

  • Qt6Quick/QML downstream proof (unblocks kirigami)
  • Real KWin build
  • GPU CS ioctl backend + Mesa HW cross-compile

See CONSOLE-TO-KDE-DESKTOP-PLAN.md for detailed desktop path planning.


6. Stale Documents — Remove

The following documents are superseded by this assessment and should be removed from local/docs/:

File Reason
IMPLEMENTATION-MASTER-PLAN.md Master plan role now covered by CONSOLE-TO-KDE v4.1 and this doc
SUBSYSTEM-ASSESSMENT-2026-05.md Assessment consolidated here with broader scope
SMP-BOOT-HARDENING-PLAN.md SMP issues and fixes incorporated here; detailed issue list can be referenced from git history
CPU-DMA-IRQ-MSI-SCHEDULER-FIX-PLAN.md MSI Phase 1 is complete; remaining DMA/scheduler work tracked here
COMPREHENSIVE-BOOT-IMPROVEMENT-PLAN.md Boot issues consolidated into this assessment

Canonical documents that remain authoritative:

  • ACPI-IMPROVEMENT-PLAN.md — detailed ACPI wave execution
  • IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md — PCI/IRQ/MSI-X details
  • USB-IMPLEMENTATION-PLAN.md — USB phase execution
  • CONSOLE-TO-KDE-DESKTOP-PLAN.md — desktop path
  • DRM-MODERNIZATION-EXECUTION-PLAN.md — GPU stack
  • WIFI-IMPLEMENTATION-PLAN.md — Wi-Fi architecture
  • BLUETOOTH-IMPLEMENTATION-PLAN.md — Bluetooth stack
  • DBUS-INTEGRATION-PLAN.md — D-Bus architecture
  • GREETER-LOGIN-IMPLEMENTATION-PLAN.md — greeter design
  • QUIRKS-SYSTEM.md — quirk infrastructure
  • PATCH-GOVERNANCE.md — patch workflow
  • BUILD-SYSTEM-HARDENING-PLAN.md — build system

7. Evidence Model

This assessment uses the same evidence vocabulary as the canonical subsystem plans:

Class Meaning
Source-visible Behavior visible in checked-in source
Build-visible Code compiles and stages in current build
QEMU-validated Behavior exercised successfully in QEMU
Runtime-validated Behavior exercised in real boot/runtime
Hardware-validated Behavior proven on named bare-metal hardware
Negative-result-documented Failures and gaps are explicitly recorded

No subsystem in this assessment is marked "hardware-validated" because no component has been proven on real bare metal with the rigor defined in ACPI-IMPROVEMENT-PLAN.md Wave 7.


8. Definition of Done

This plan is complete when:

  1. SMP brings up all cores reliably on AMD and Intel bare metal
  2. C-states reduce idle power consumption measurably
  3. ACPI thermal zones are readable and thermald responds to trip points
  4. At least 2 sensor drivers report temperature on bare metal
  5. EHCI driver enables USB keyboard on systems without xHCI routing
  6. MSI-X is adopted by all new PCI drivers; legacy IRQ is documented fallback
  7. IOMMU validates on at least one AMD and one Intel platform
  8. Logging has rotation and per-service separation
  9. Udev-shim supports event-driven hotplug
  10. A validation matrix with 4+ hardware targets is published and maintained

End of assessment.