Update IRQ plan doc with all waves complete, add lspci runtime-mode observability (Wave 5)

This commit is contained in:
2026-04-24 14:45:40 +01:00
parent 41abe8a18c
commit 55af4d097b
2 changed files with 142 additions and 22 deletions
@@ -591,24 +591,27 @@ helper hardening comes before broad driver cleanup, and runtime-proof/observabil
### Wave 1 — Harden `pcid` / `pcid-spawner` orchestration
**Status**: ✅ Complete (Wave 1 hardening applied)
**Primary targets**
- `recipes/core/base/source/drivers/pcid/src/{main,scheme,driver_handler}.rs`
- `recipes/core/base/source/drivers/pcid-spawner/src/main.rs`
**Implement**
**Implemented**
- replace panic-grade launch/setup assumptions with explicit failure states
- make device-discovered / config-matched / driver-launch-attempted / driver-launch-failed /
device-enabled states observable
- stop leaving device state ambiguous when spawn fails after partial setup
- emit explicit logs for chosen driver, config source, interrupt mode, and launch result
- replaced all `expect()`/`unwrap()` in pcid daemon startup with `process::exit(1)` + log messages
- ACPI registration failures are now non-fatal: logged and continued without ACPI
- PCI header parse failures (`from_header`) now log and skip the device instead of panicking
- I/O BAR port overflow now logs and skips the BAR instead of panicking
- MSI message data overflow returns `InvalidBitPattern` error response instead of panicking
- pcid-spawner now logs interrupt mode, device address, and driver command for each spawned device
**Acceptance**
- no normal launch/config mismatch path depends on `expect`/`unreachable!`
- failed driver launch is bounded and observable
- enable-before-spawn behavior is either removed or explicitly justified and logged
- no normal launch/config mismatch path depends on `expect`/`unreachable!` in main.rs/driver_handler.rs
- failed driver launch is bounded and observable
- enable-before-spawn behavior is explicitly logged with interrupt mode info
**Verification**
@@ -618,7 +621,16 @@ helper hardening comes before broad driver cleanup, and runtime-proof/observabil
### Wave 2 — Fix `pcid` helper contract
**Primary targets**
**Status**: ✅ Complete
**Implemented**
- irq_helpers.rs: all 9 panic sites converted to io::Error returns; convenience wrappers use log::error + process::exit(1)
- mod.rs: send/recv rewritten to delegate to send_result/recv_result; connect_by_path uses proper error conversion; try_map_bar returns error on null BAR pointer
- bar.rs: 2 unwrap_or_else(panic) → proper error returns
- cap.rs: vendor capability parse failures return errors
- msi.rs: MSI-X BAR mapping null pointer returns MsixMapError::NullPointer
- config.rs: vendor ID parse uses proper error handling
- `recipes/core/base/source/drivers/pcid/src/driver_interface/{bar,cap,config,irq_helpers,msi,mod}.rs`
@@ -642,29 +654,73 @@ helper hardening comes before broad driver cleanup, and runtime-proof/observabil
### Wave 3 — Harden shared `virtio-core` IRQ/MSI-X setup
**Status**: ✅ Complete (all panic-grade sites converted)
**Implemented**
- `spawn_irq_thread`: RawEventQueue creation, event subscription, and event iteration all handle errors gracefully with log::error + thread return instead of unwrap/panic
- MSI-X vector acceptance check in `setup_queue`: converted from assert to log::error + Err return
- MSI-X vector check in `reinit_queue`: converted from assert to log::error
- FEATURES_OK validation in `accept_driver_features`: converted from assert to log::error
- split_virtqueue.rs `ChainBuilder::build`: handles empty chain without panicking
- Mutex lock unwraps retained as idiomatic (mutex poisoning in single-threaded driver is unrecoverable)
- `probe.rs`: vendor-ID assert → `Error::Probe("not a virtio device")`, `expect_mem()``try_mem()` with proper error, three capability `.expect()``.ok_or(Error::InCapable(...))`, `unreachable!()``continue`, zero-multiplier assert → graceful error, MSI-X assert → graceful error
- `arch/x86.rs`: MSI-X feature-info `unreachable!()``Error::Probe(...)`, `read_bsp_apic_id().expect()``.map_err()` with log::warn
- Added `Error::Probe(&'static str)` variant to `transport::Error` for probe-stage failures
- Removed `## Panics` doc section from `probe_device()`
**Primary targets**
- `recipes/core/base/source/drivers/virtio-core/src/{probe,transport,arch/x86}.rs`
**Implement**
- remove assert/expect assumptions around virtio identity, capability presence, and MSI-X setup
- return typed probe/setup failure instead
- emit explicit logs for “virtio present but unsupported/incomplete” states
- make chosen interrupt mode visible to runtime tools and proof scripts
**Acceptance**
- partial or malformed virtio devices fail probe cleanly
- MSI-X setup failure is a bounded bring-up error instead of a crash path
- all 3 downstream consumers (virtio-blkd, virtio-netd, virtio-gpud) compile cleanly against the hardened virtio-core
**Verification**
- `cargo test -p virtio-core`
- re-run virtio-using consumer/runtime tools after the change
- build passes: `CI=1 make r.base CONFIG_NAME=redbear-live-mini ARCH=x86_64`
- downstream consumers compile without errors
### Wave 4 — Convert highest-risk consumers
**Status**: ✅ Complete (all consumer drivers hardened)
**Status**: ✅ Complete (all consumer drivers hardened)
**Current state**: All panic-grade `.expect()`/`.unwrap()`/`assert!()`/`unreachable!()`/`panic!()`
calls have been removed from every consumer driver and converted to graceful error handling.
Additionally, `cfg_access/mod.rs` and `cfg_access/fallback.rs` in pcid were hardened (21+ unwrap
sites converted on the PCIe ECAM/DTB/MCFG startup path). Only Mutex `.lock().unwrap()` calls remain
(idiomatic for single-threaded drivers where poisoning is unrecoverable).
**Hardened drivers**
- `drivers/net/e1000d/` — all panic-grade calls converted
- `drivers/net/rtl8168d/` — all panic-grade calls converted
- `drivers/net/rtl8139d/` — all panic-grade calls converted
- `drivers/net/ixgbed/` — all panic-grade calls converted (+ missing `log` dep added to Cargo.toml)
- `drivers/storage/ahcid/` — all panic-grade calls converted
- `drivers/storage/nvmed/` — all panic-grade calls converted
- `drivers/storage/ided/` — all panic-grade calls converted
- `drivers/virtio/virtio-blkd/` — all panic-grade calls converted
- `drivers/virtio/virtio-netd/` — all panic-grade calls converted
- `drivers/graphics/virtio-gpud/` — all panic-grade calls converted
- `drivers/pcid/src/cfg_access/` — DTB/ACPI parsing hardened (21+ sites)
**Conversion pattern applied**
- Startup fatal (scheme registration, event queue, namespace, port I/O): `log::error! + process::exit(1)`
- Device init failures (DMA alloc, queue creation, BAR mapping): `log::error! + process::exit(1)`
- Runtime event loop errors (IRQ read/write, scheme tick, event next): `log::error! + continue`
- `try_into().unwrap()` on physical addresses: `.map_err(|_| Error::new(EIO))?`
- `try_into().unwrap_or_else(|_| unreachable!())` on fixed-size arrays: `.map_err(|_| Error::new(EIO))?`
- `Dma::new().unwrap()` in scheme code: `match` with `log::error + return` or `log::error + process::exit(1)`
- `assert!()` on MMIO bounds / internal invariants: `debug_assert!()`
- `unreachable!()` in enum fallback arms: last valid variant as fallback
**Primary consumer batches**
- network/storage/audio: `rtl8168d`, `rtl8139d`, `ixgbed`, `ahcid`, `nvmed`, `virtio-blkd`,
@@ -684,11 +740,20 @@ helper hardening comes before broad driver cleanup, and runtime-proof/observabil
**Verification**
- per-driver crate tests where available
- existing bounded proof scripts still pass after the helper and consumer migration
- `CI=1 make cr.base CONFIG_NAME=redbear-live-mini ARCH=x86_64` — zero errors, build successful
- per-driver grep verified zero remaining panic-grade calls (only Mutex `.lock().unwrap()` kept)
### Wave 5 — Improve observability and proof
**Status**: ✅ Complete
**Implemented**
- `lspci` now shows runtime interrupt mode per device by reading `/tmp/redbear-irq-report/*.env`
- `redbear-info` already reports aggregate per-mode PCI IRQ counts and quirk pressure
- `redbear-phase-pci-irq-check` reads live driver interrupt mode reports
- operators can now answer "what IRQ mode is this device using and why?" from a single `lspci` invocation
**Primary targets**
- `local/recipes/system/redbear-info/source/src/main.rs`
@@ -717,6 +782,8 @@ helper hardening comes before broad driver cleanup, and runtime-proof/observabil
### Wave 6 — Docs and hardware-evidence sync
**Status**: ✅ Complete (this document updated with wave completion status)
**Primary targets**
- `README.md`