Files
RedBear-OS/local/docs/archived/XHCID-DEVICE-IMPROVEMENT-PLAN.md
T
vasilito 13ac42b218 docs: final stale doc cleanup — 22 archived, 18 active
Archived: IOMMU-SPEC, KERNEL-IPC, KERNEL-SCHEDULER, PROFILE-MATRIX,
QUIRKS-IMPROVEMENT, RELIBC-IPC, repo-governance, SCHEDULER-REVIEW,
SCRIPT-BEHAVIOR, USB-VALIDATION, XHCID-DEVICE-IMPROVEMENT.

Active: all implementation plans + 3 audits + governance docs.
2026-05-03 16:26:13 +01:00

361 lines
11 KiB
Markdown

# xhcid Device-Level Improvement Plan
## Purpose
This document defines the implementation sequence for hardening `xhcid` at the device level in
Red Bear OS.
It is a focused companion to `local/docs/USB-IMPLEMENTATION-PLAN.md`. The USB plan remains the
subsystem-wide authority; this document narrows scope to the `xhcid` device lifecycle,
configuration, teardown, PM behavior, enumerator robustness, and bounded proof coverage.
## Scope
In scope:
- `recipes/core/base/source/drivers/usb/xhcid/src/xhci/device_enumerator.rs`
- `recipes/core/base/source/drivers/usb/xhcid/src/xhci/mod.rs`
- `recipes/core/base/source/drivers/usb/xhcid/src/xhci/scheme.rs`
- `recipes/core/base/source/drivers/usb/xhcid/src/xhci/irq_reactor.rs`
- bounded QEMU validation scripts under `local/scripts/`
- canonical USB documentation under `local/docs/`
Out of scope:
- generic USB redesign
- unrelated class-driver feature work
- hardware-validation claims beyond what the repo can currently prove
## Repo-Fit Note
Technical implementation targets live in upstream-owned source under
`recipes/core/base/source/...`, but durable Red Bear preservation belongs in
`local/patches/base/`. This plan names the technical work locations, not a recommendation to leave
work stranded only in upstream-owned trees.
## Current Audited Findings
The current `xhcid` tree has already improved materially:
- lifecycle gating exists through `PortLifecycle` and `PortOperationGuard`
- `configure_endpoints_once()` is now transactional relative to earlier behavior
- detach waits before removing published state
- a bounded QEMU lifecycle proof exists
Remaining risks:
- partial attach visibility still exists around publication timing
- detach can still depend on bounded-but-incomplete purge semantics
- suspend/resume is still mainly software gating
- rollback failure is not yet a fully hardened degraded-state path
- enumerator logic still relies on timing- and assumption-heavy behavior
- proof coverage is still QEMU-bounded and misses key interleavings
## Design Invariants
The implementation should satisfy these invariants:
1. No half-attached device is publicly usable.
2. No new work is admitted after detach begins.
3. Detach always reaches a bounded terminal outcome.
4. Failed configure leaves either the old config intact or the device explicitly
degraded/reset-required.
5. PM transitions reflect actual usable state, not only software policy.
6. Enumerator behavior is bounded and diagnosable, not panic-driven.
7. Validation claims match what scripts actually prove.
## Phase 1 — Proof-First Expansion
### Goal
Make the current blind spots reproducible before changing behavior.
### Work
- extend `test-xhci-device-lifecycle-qemu.sh`
- extend `test-usb-qemu.sh`
- extend `test-xhci-irq-qemu.sh`
- add bounded injection hooks in `xhcid` for configure-failure and attach/detach timing cases
### Required Cases
- repeated attach/detach
- detach during storage startup
- transfer-during-detach surrogate
- configure failure injection
- suspend/resume admission checks
- rapid event ordering cases
### Per-File Focus
#### `local/scripts/test-xhci-device-lifecycle-qemu.sh`
- add repeated HID/storage attach-detach loops
- add detach-during-driver-start for storage
- add storage attach long enough to exercise startup/read activity before unplug
- require explicit attach-entered, attach-finished, detach-completed evidence
#### `local/scripts/test-usb-qemu.sh`
- separate boot progress from proof failure
- keep result lines distinct for xHCI init, HID spawn, SCSI spawn, bounded readback, and crash scan
- add repeated full-stack run mode or bounded loop count if needed for ordering-sensitive regressions
#### `local/scripts/test-xhci-irq-qemu.sh`
- verify interrupt-mode evidence still holds under actual attached-device pressure, not only empty-controller boot
#### `xhci` test hooks
- add bounded test-only failure hooks in `scheme.rs` / `mod.rs` for:
- fail after `CONFIGURE_ENDPOINT`
- fail after `SET_CONFIGURATION`
- optional delay before final attach commit
- current bounded implementation uses one-shot guest-side commands written to
`/tmp/xhcid-test-hook`, consumed by `xhcid` on the next matching lifecycle point
### Exit Criteria
- scripts are syntax-clean
- new cases fail meaningfully on current gaps
- failures identify the specific missed milestone
## Phase 2 — Atomic Attach Publication
### Goal
Prevent half-built devices from becoming publicly reachable.
### Work
- refactor `Xhci::attach_device`
- split attach staging from published `PortState`
- narrow lifecycle exposure so scheme paths cannot reach a device before final commit
- make attach cleanup direct for prepublication failure
### Key Targets
- `xhci/mod.rs::Xhci::attach_device`
- `xhci/mod.rs::PortLifecycle::*`
- `xhci/device_enumerator.rs::DeviceEnumerator::run`
### Per-File Focus
#### `xhci/mod.rs`
- stop inserting into `port_states` before all attach substeps complete
- keep slot, input context, EP0 ring, quirks, and descriptors in a private staging carrier
- commit published `PortState` in one final block
- keep prepublication cleanup separate from `detach_device()` where possible
#### `xhci/device_enumerator.rs`
- ensure duplicate connect handling still treats `EAGAIN` or equivalent as "already published" rather than "half-built staging state"
### Exit Criteria
- no public state before attach commit
- attach failure leaves no published device and no child driver
## Phase 3 — Bounded Detach and Purge
### Goal
Make teardown bounded, dominant, and safe against stale completions.
### Work
- bound `PortLifecycle::begin_detaching()`
- reject all new work immediately once detach starts
- purge or tombstone pending transfer/reactor state
- separate graceful drain from forced teardown
- preserve correct slot-disable/remove ordering
- ensure child-driver shutdown cannot wedge detach
### Key Targets
- `xhci/mod.rs`
- `xhci/irq_reactor.rs`
- transfer bookkeeping in `xhci/scheme.rs`
### Per-File Focus
#### `xhci/mod.rs`
- add timeout or bounded wait to detach drain logic
- distinguish graceful drain from forced teardown
- keep `port_states.remove(...)` after terminal teardown outcome
#### `xhci/irq_reactor.rs`
- add per-port invalidation or tombstone behavior so stale completions cannot target removed state
#### `xhci/scheme.rs`
- ensure operation-entry helpers fail immediately once detach starts
### Exit Criteria
- detach cannot hang forever
- no stale completion can target removed device state
- unload-under-activity proof passes
## Phase 4 — Configure Rollback Hardening
### Goal
Make configuration changes fully transactional and recoverable.
### Work
- formalize stage/program/commit boundaries
- ensure snapshots cover all mutated controller-facing state
- promote rollback failure into explicit degraded-state handling
- define deterministic behavior for post-`SET_CONFIGURATION` failure
- keep alternate/config bookkeeping coherent after rollback
- quarantine or reset on unrecoverable ambiguity
### Key Targets
- `xhci/scheme.rs::configure_endpoints_once`
- `restore_configure_input_context`
- `configure_endpoints`
- `set_configuration`
- `set_interface`
### Per-File Focus
#### `xhci/scheme.rs`
- keep endpoint/ring state staged until commit
- verify snapshots cover every mutated slot/endpoint field
- treat rollback failure as a first-class degraded state
- ensure post-failure descriptor and alternate bookkeeping still reflect live state
### Exit Criteria
- injected configure failure preserves old state or explicitly degrades/resets device
- no staged endpoint state leaks into live software state
## Phase 5 — Real PM Sequencing
### Goal
Replace software-only PM gating with meaningful quiesce/resume semantics.
### Work
- define richer PM transition states
- quiesce before suspend
- tie resume to controller/device validity
- define PM interaction with detach
- define PM interaction with configure
- add bounded PM proof cases
### Key Targets
- `xhci/scheme.rs::suspend_device`
- `xhci/scheme.rs::resume_device`
- `xhci/scheme.rs::ensure_port_active`
- supporting helpers in `xhci/mod.rs`
### Exit Criteria
- suspend blocks new I/O only after quiesce starts
- resume only returns success from a genuinely usable state
- PM/detach/configure interleavings are deterministic
## Phase 6 — Enumerator Cleanup and Timing Hardening
### Goal
Remove panic-style and magic-delay behavior from the enumerator path.
### Work
- remove panic-class assumptions from `DeviceEnumerator::run`
- replace fixed sleeps with bounded readiness checks
- make duplicate/out-of-order event handling explicit
- align enumerator decisions with the new attach/detach state machine
- improve logging for reset/attach/detach milestones
### Key Targets
- `xhci/device_enumerator.rs`
- supporting interactions in `xhci/mod.rs`
### Exit Criteria
- no ordinary event path panics
- no unnecessary fixed sleep remains
- rapid event-order tests pass in QEMU
## Phase 7 — Final Validation, Docs, and Preservation
### Goal
Close the loop with evidence, canonical docs, and durable patch carriers.
### Work
- rerun the full bounded proof matrix on a rebuilt image
- run source-level verification (`lsp_diagnostics`, `cargo check`, `cargo test`)
- update canonical docs:
- `local/docs/USB-IMPLEMENTATION-PLAN.md`
- `local/docs/USB-VALIDATION-RUNBOOK.md`
- immutable archived durable patch carriers under `local/patches/base/`
- delete only clearly stale, superseded docs after link sweep
### Exit Criteria
- all bounded USB/xHCI proofs pass on a fresh image
- changed files are diagnostics-clean
- canonical docs match actual proof scope
- patch carrier is immutable archived and reapplicable
## Validation Matrix
Required final proofs:
- `bash ./local/scripts/test-xhci-device-lifecycle-qemu.sh --check <tracked-target>`
- `bash ./local/scripts/test-usb-qemu.sh --check <tracked-target>`
- `bash ./local/scripts/test-xhci-irq-qemu.sh --check`
- `bash ./local/scripts/test-usb-maturity-qemu.sh <tracked-target>`
Required source checks:
- `lsp_diagnostics` on all changed files
- `cargo check` / `cargo test` for `xhcid`
- `cargo check` for any touched class daemon or helper crate
## Commit Strategy
1. proof/harness expansion
2. atomic attach publication
3. bounded detach and purge
4. configure rollback hardening
5. PM sequencing
6. enumerator cleanup
7. docs, patch preservation, stale-doc cleanup
## Canonical Doc Authority
Authoritative docs after cleanup:
- `local/docs/USB-IMPLEMENTATION-PLAN.md`
- `local/docs/USB-VALIDATION-RUNBOOK.md`
This xhcid plan is a focused implementation document beneath those subsystem-level authorities.
## Completion Standard
This work is complete only when:
- all seven phases are done in order
- no changed-file diagnostics remain
- `xhcid` builds/tests cleanly
- bounded QEMU proof matrix passes on a rebuilt image
- canonical docs are synchronized
- durable patch carrier is immutable archived
- remaining gaps, if any, are explicitly documented as future or hardware-only work