13ac42b218
Archived: IOMMU-SPEC, KERNEL-IPC, KERNEL-SCHEDULER, PROFILE-MATRIX, QUIRKS-IMPROVEMENT, RELIBC-IPC, repo-governance, SCHEDULER-REVIEW, SCRIPT-BEHAVIOR, USB-VALIDATION, XHCID-DEVICE-IMPROVEMENT. Active: all implementation plans + 3 audits + governance docs.
361 lines
11 KiB
Markdown
361 lines
11 KiB
Markdown
# xhcid Device-Level Improvement Plan
|
|
|
|
## Purpose
|
|
|
|
This document defines the implementation sequence for hardening `xhcid` at the device level in
|
|
Red Bear OS.
|
|
|
|
It is a focused companion to `local/docs/USB-IMPLEMENTATION-PLAN.md`. The USB plan remains the
|
|
subsystem-wide authority; this document narrows scope to the `xhcid` device lifecycle,
|
|
configuration, teardown, PM behavior, enumerator robustness, and bounded proof coverage.
|
|
|
|
## Scope
|
|
|
|
In scope:
|
|
|
|
- `recipes/core/base/source/drivers/usb/xhcid/src/xhci/device_enumerator.rs`
|
|
- `recipes/core/base/source/drivers/usb/xhcid/src/xhci/mod.rs`
|
|
- `recipes/core/base/source/drivers/usb/xhcid/src/xhci/scheme.rs`
|
|
- `recipes/core/base/source/drivers/usb/xhcid/src/xhci/irq_reactor.rs`
|
|
- bounded QEMU validation scripts under `local/scripts/`
|
|
- canonical USB documentation under `local/docs/`
|
|
|
|
Out of scope:
|
|
|
|
- generic USB redesign
|
|
- unrelated class-driver feature work
|
|
- hardware-validation claims beyond what the repo can currently prove
|
|
|
|
## Repo-Fit Note
|
|
|
|
Technical implementation targets live in upstream-owned source under
|
|
`recipes/core/base/source/...`, but durable Red Bear preservation belongs in
|
|
`local/patches/base/`. This plan names the technical work locations, not a recommendation to leave
|
|
work stranded only in upstream-owned trees.
|
|
|
|
## Current Audited Findings
|
|
|
|
The current `xhcid` tree has already improved materially:
|
|
|
|
- lifecycle gating exists through `PortLifecycle` and `PortOperationGuard`
|
|
- `configure_endpoints_once()` is now transactional relative to earlier behavior
|
|
- detach waits before removing published state
|
|
- a bounded QEMU lifecycle proof exists
|
|
|
|
Remaining risks:
|
|
|
|
- partial attach visibility still exists around publication timing
|
|
- detach can still depend on bounded-but-incomplete purge semantics
|
|
- suspend/resume is still mainly software gating
|
|
- rollback failure is not yet a fully hardened degraded-state path
|
|
- enumerator logic still relies on timing- and assumption-heavy behavior
|
|
- proof coverage is still QEMU-bounded and misses key interleavings
|
|
|
|
## Design Invariants
|
|
|
|
The implementation should satisfy these invariants:
|
|
|
|
1. No half-attached device is publicly usable.
|
|
2. No new work is admitted after detach begins.
|
|
3. Detach always reaches a bounded terminal outcome.
|
|
4. Failed configure leaves either the old config intact or the device explicitly
|
|
degraded/reset-required.
|
|
5. PM transitions reflect actual usable state, not only software policy.
|
|
6. Enumerator behavior is bounded and diagnosable, not panic-driven.
|
|
7. Validation claims match what scripts actually prove.
|
|
|
|
## Phase 1 — Proof-First Expansion
|
|
|
|
### Goal
|
|
|
|
Make the current blind spots reproducible before changing behavior.
|
|
|
|
### Work
|
|
|
|
- extend `test-xhci-device-lifecycle-qemu.sh`
|
|
- extend `test-usb-qemu.sh`
|
|
- extend `test-xhci-irq-qemu.sh`
|
|
- add bounded injection hooks in `xhcid` for configure-failure and attach/detach timing cases
|
|
|
|
### Required Cases
|
|
|
|
- repeated attach/detach
|
|
- detach during storage startup
|
|
- transfer-during-detach surrogate
|
|
- configure failure injection
|
|
- suspend/resume admission checks
|
|
- rapid event ordering cases
|
|
|
|
### Per-File Focus
|
|
|
|
#### `local/scripts/test-xhci-device-lifecycle-qemu.sh`
|
|
|
|
- add repeated HID/storage attach-detach loops
|
|
- add detach-during-driver-start for storage
|
|
- add storage attach long enough to exercise startup/read activity before unplug
|
|
- require explicit attach-entered, attach-finished, detach-completed evidence
|
|
|
|
#### `local/scripts/test-usb-qemu.sh`
|
|
|
|
- separate boot progress from proof failure
|
|
- keep result lines distinct for xHCI init, HID spawn, SCSI spawn, bounded readback, and crash scan
|
|
- add repeated full-stack run mode or bounded loop count if needed for ordering-sensitive regressions
|
|
|
|
#### `local/scripts/test-xhci-irq-qemu.sh`
|
|
|
|
- verify interrupt-mode evidence still holds under actual attached-device pressure, not only empty-controller boot
|
|
|
|
#### `xhci` test hooks
|
|
|
|
- add bounded test-only failure hooks in `scheme.rs` / `mod.rs` for:
|
|
- fail after `CONFIGURE_ENDPOINT`
|
|
- fail after `SET_CONFIGURATION`
|
|
- optional delay before final attach commit
|
|
- current bounded implementation uses one-shot guest-side commands written to
|
|
`/tmp/xhcid-test-hook`, consumed by `xhcid` on the next matching lifecycle point
|
|
|
|
### Exit Criteria
|
|
|
|
- scripts are syntax-clean
|
|
- new cases fail meaningfully on current gaps
|
|
- failures identify the specific missed milestone
|
|
|
|
## Phase 2 — Atomic Attach Publication
|
|
|
|
### Goal
|
|
|
|
Prevent half-built devices from becoming publicly reachable.
|
|
|
|
### Work
|
|
|
|
- refactor `Xhci::attach_device`
|
|
- split attach staging from published `PortState`
|
|
- narrow lifecycle exposure so scheme paths cannot reach a device before final commit
|
|
- make attach cleanup direct for prepublication failure
|
|
|
|
### Key Targets
|
|
|
|
- `xhci/mod.rs::Xhci::attach_device`
|
|
- `xhci/mod.rs::PortLifecycle::*`
|
|
- `xhci/device_enumerator.rs::DeviceEnumerator::run`
|
|
|
|
### Per-File Focus
|
|
|
|
#### `xhci/mod.rs`
|
|
|
|
- stop inserting into `port_states` before all attach substeps complete
|
|
- keep slot, input context, EP0 ring, quirks, and descriptors in a private staging carrier
|
|
- commit published `PortState` in one final block
|
|
- keep prepublication cleanup separate from `detach_device()` where possible
|
|
|
|
#### `xhci/device_enumerator.rs`
|
|
|
|
- ensure duplicate connect handling still treats `EAGAIN` or equivalent as "already published" rather than "half-built staging state"
|
|
|
|
### Exit Criteria
|
|
|
|
- no public state before attach commit
|
|
- attach failure leaves no published device and no child driver
|
|
|
|
## Phase 3 — Bounded Detach and Purge
|
|
|
|
### Goal
|
|
|
|
Make teardown bounded, dominant, and safe against stale completions.
|
|
|
|
### Work
|
|
|
|
- bound `PortLifecycle::begin_detaching()`
|
|
- reject all new work immediately once detach starts
|
|
- purge or tombstone pending transfer/reactor state
|
|
- separate graceful drain from forced teardown
|
|
- preserve correct slot-disable/remove ordering
|
|
- ensure child-driver shutdown cannot wedge detach
|
|
|
|
### Key Targets
|
|
|
|
- `xhci/mod.rs`
|
|
- `xhci/irq_reactor.rs`
|
|
- transfer bookkeeping in `xhci/scheme.rs`
|
|
|
|
### Per-File Focus
|
|
|
|
#### `xhci/mod.rs`
|
|
|
|
- add timeout or bounded wait to detach drain logic
|
|
- distinguish graceful drain from forced teardown
|
|
- keep `port_states.remove(...)` after terminal teardown outcome
|
|
|
|
#### `xhci/irq_reactor.rs`
|
|
|
|
- add per-port invalidation or tombstone behavior so stale completions cannot target removed state
|
|
|
|
#### `xhci/scheme.rs`
|
|
|
|
- ensure operation-entry helpers fail immediately once detach starts
|
|
|
|
### Exit Criteria
|
|
|
|
- detach cannot hang forever
|
|
- no stale completion can target removed device state
|
|
- unload-under-activity proof passes
|
|
|
|
## Phase 4 — Configure Rollback Hardening
|
|
|
|
### Goal
|
|
|
|
Make configuration changes fully transactional and recoverable.
|
|
|
|
### Work
|
|
|
|
- formalize stage/program/commit boundaries
|
|
- ensure snapshots cover all mutated controller-facing state
|
|
- promote rollback failure into explicit degraded-state handling
|
|
- define deterministic behavior for post-`SET_CONFIGURATION` failure
|
|
- keep alternate/config bookkeeping coherent after rollback
|
|
- quarantine or reset on unrecoverable ambiguity
|
|
|
|
### Key Targets
|
|
|
|
- `xhci/scheme.rs::configure_endpoints_once`
|
|
- `restore_configure_input_context`
|
|
- `configure_endpoints`
|
|
- `set_configuration`
|
|
- `set_interface`
|
|
|
|
### Per-File Focus
|
|
|
|
#### `xhci/scheme.rs`
|
|
|
|
- keep endpoint/ring state staged until commit
|
|
- verify snapshots cover every mutated slot/endpoint field
|
|
- treat rollback failure as a first-class degraded state
|
|
- ensure post-failure descriptor and alternate bookkeeping still reflect live state
|
|
|
|
### Exit Criteria
|
|
|
|
- injected configure failure preserves old state or explicitly degrades/resets device
|
|
- no staged endpoint state leaks into live software state
|
|
|
|
## Phase 5 — Real PM Sequencing
|
|
|
|
### Goal
|
|
|
|
Replace software-only PM gating with meaningful quiesce/resume semantics.
|
|
|
|
### Work
|
|
|
|
- define richer PM transition states
|
|
- quiesce before suspend
|
|
- tie resume to controller/device validity
|
|
- define PM interaction with detach
|
|
- define PM interaction with configure
|
|
- add bounded PM proof cases
|
|
|
|
### Key Targets
|
|
|
|
- `xhci/scheme.rs::suspend_device`
|
|
- `xhci/scheme.rs::resume_device`
|
|
- `xhci/scheme.rs::ensure_port_active`
|
|
- supporting helpers in `xhci/mod.rs`
|
|
|
|
### Exit Criteria
|
|
|
|
- suspend blocks new I/O only after quiesce starts
|
|
- resume only returns success from a genuinely usable state
|
|
- PM/detach/configure interleavings are deterministic
|
|
|
|
## Phase 6 — Enumerator Cleanup and Timing Hardening
|
|
|
|
### Goal
|
|
|
|
Remove panic-style and magic-delay behavior from the enumerator path.
|
|
|
|
### Work
|
|
|
|
- remove panic-class assumptions from `DeviceEnumerator::run`
|
|
- replace fixed sleeps with bounded readiness checks
|
|
- make duplicate/out-of-order event handling explicit
|
|
- align enumerator decisions with the new attach/detach state machine
|
|
- improve logging for reset/attach/detach milestones
|
|
|
|
### Key Targets
|
|
|
|
- `xhci/device_enumerator.rs`
|
|
- supporting interactions in `xhci/mod.rs`
|
|
|
|
### Exit Criteria
|
|
|
|
- no ordinary event path panics
|
|
- no unnecessary fixed sleep remains
|
|
- rapid event-order tests pass in QEMU
|
|
|
|
## Phase 7 — Final Validation, Docs, and Preservation
|
|
|
|
### Goal
|
|
|
|
Close the loop with evidence, canonical docs, and durable patch carriers.
|
|
|
|
### Work
|
|
|
|
- rerun the full bounded proof matrix on a rebuilt image
|
|
- run source-level verification (`lsp_diagnostics`, `cargo check`, `cargo test`)
|
|
- update canonical docs:
|
|
- `local/docs/USB-IMPLEMENTATION-PLAN.md`
|
|
- `local/docs/USB-VALIDATION-RUNBOOK.md`
|
|
- immutable archived durable patch carriers under `local/patches/base/`
|
|
- delete only clearly stale, superseded docs after link sweep
|
|
|
|
### Exit Criteria
|
|
|
|
- all bounded USB/xHCI proofs pass on a fresh image
|
|
- changed files are diagnostics-clean
|
|
- canonical docs match actual proof scope
|
|
- patch carrier is immutable archived and reapplicable
|
|
|
|
## Validation Matrix
|
|
|
|
Required final proofs:
|
|
|
|
- `bash ./local/scripts/test-xhci-device-lifecycle-qemu.sh --check <tracked-target>`
|
|
- `bash ./local/scripts/test-usb-qemu.sh --check <tracked-target>`
|
|
- `bash ./local/scripts/test-xhci-irq-qemu.sh --check`
|
|
- `bash ./local/scripts/test-usb-maturity-qemu.sh <tracked-target>`
|
|
|
|
Required source checks:
|
|
|
|
- `lsp_diagnostics` on all changed files
|
|
- `cargo check` / `cargo test` for `xhcid`
|
|
- `cargo check` for any touched class daemon or helper crate
|
|
|
|
## Commit Strategy
|
|
|
|
1. proof/harness expansion
|
|
2. atomic attach publication
|
|
3. bounded detach and purge
|
|
4. configure rollback hardening
|
|
5. PM sequencing
|
|
6. enumerator cleanup
|
|
7. docs, patch preservation, stale-doc cleanup
|
|
|
|
## Canonical Doc Authority
|
|
|
|
Authoritative docs after cleanup:
|
|
|
|
- `local/docs/USB-IMPLEMENTATION-PLAN.md`
|
|
- `local/docs/USB-VALIDATION-RUNBOOK.md`
|
|
|
|
This xhcid plan is a focused implementation document beneath those subsystem-level authorities.
|
|
|
|
## Completion Standard
|
|
|
|
This work is complete only when:
|
|
|
|
- all seven phases are done in order
|
|
- no changed-file diagnostics remain
|
|
- `xhcid` builds/tests cleanly
|
|
- bounded QEMU proof matrix passes on a rebuilt image
|
|
- canonical docs are synchronized
|
|
- durable patch carrier is immutable archived
|
|
- remaining gaps, if any, are explicitly documented as future or hardware-only work
|