029472d5e3
dma.rs: IommuDmaAllocator (145 lines) - New struct wires existing IOMMU daemon (1003 lines) to existing DmaBuffer (261) - allocate(): phys-contiguous alloc via scheme:memory, then MAP through IOMMU domain - unmap(): sends UNMAP to IOMMU domain, releases IOVA - Inlined IOMMU protocol constants — no new crate dependency - encode_iommu_request/decode_iommu_response for scheme write/read cycle Documentation updates: - IMPLEMENTATION-MASTER-PLAN.md: K2 DMA/IOMMU section expanded from 3-line gap list to full audit with component inventory, gap analysis, implementation plan (D2.1-D2.5), Linux reference table. Added K2b thread/fork audit. - CPU-DMA-IRQ-MSI-SCHEDULER-FIX-PLAN.md: Phase 1 (MSI) marked complete with per-task status. Phase 2 (DMA) re-scoped from 'create' to 'wire' based on audit. Phase 3 (scheduler) marked mostly done. - IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md: kernel MSI support noted as materially strong with P8-msi.patch reference. Audit findings: - IOMMU daemon is solid: 1003-line lib.rs with full scheme protocol, 427-line amd_vi.rs, host-runnable tests. Needs wiring, not rewriting. - DmaBuffer exists but is IOMMU-unaware — IommuDmaAllocator bridges this. - relibc rlct_clone is correct for threads (shares addr space implicitly). '3 IPC hops' claim is microkernel-architectural, not a real perf issue. - No stale docs to archive at this time.
159 lines
6.7 KiB
Markdown
159 lines
6.7 KiB
Markdown
# Red Bear OS — CPU/DMA/IRQ/MSI/Scheduler Fix Plan
|
||
|
||
**Date**: 2026-05-04
|
||
**Updated**: 2026-05-04 (MSI T1.1–T2.2 implemented, committed, pushed)
|
||
**Status**: Active — MSI Phase 1 complete, DMA/Scheduler pending
|
||
**Source of truth**: Linux kernel 7.0 (local/reference/linux-7.0/)
|
||
|
||
## 1. Problem Statement
|
||
|
||
Five critical integration gaps in the microkernel architecture:
|
||
|
||
| Gap | Severity | Impact | Status |
|
||
|-----|----------|--------|--------|
|
||
| MSI absent from kernel | CRITICAL | All NVMe/GPU/NIC on legacy INTx | ✅ RESOLVED (P8-msi.patch) |
|
||
| DMA/IOMMU not integrated | CRITICAL | DMA buffers unprotected | ⏳ Pending |
|
||
| PIT tick (148Hz) vs LAPIC (1000Hz) | HIGH | Scheduler 6x slower than Linux | ✅ RESOLVED (P7-scheduler patch) |
|
||
| Global scheduler lock | HIGH | Serializes all context switches | ✅ RESOLVED (work-stealing) |
|
||
| Thread creation (3 IPC hops) | HIGH | 3x slower than Linux clone() | ⏳ Pending |
|
||
|
||
## 2. Phase 1: MSI/MSI-X in Kernel (Week 1-3) ✅ COMPLETE
|
||
|
||
### T1.1: MSI Capability Parsing ✅ DONE
|
||
- File: `kernel/src/arch/x86_shared/device/msi.rs` (61 lines)
|
||
- Commit: `678980521` in `P8-msi.patch`
|
||
- Linux ref: `arch/x86/kernel/apic/msi.c` (391 lines)
|
||
- Implements: `MsiMessage` (compose/validate), `MsiCapability` (parse 32/64-bit), `MsixCapability` (parse table/PBA), `is_valid_msi_address`, `is_valid_msi_vector`
|
||
- Bounds-safe: all `parse()` methods return `Option<Self>`, using `.get()` instead of raw indexing
|
||
|
||
### T1.2: Vector Allocation Matrix ✅ DONE
|
||
- File: `kernel/src/arch/x86_shared/device/vector.rs` (53 lines)
|
||
- Commit: `678980521` in `P8-msi.patch`
|
||
- Linux ref: `arch/x86/kernel/apic/vector.c` (1387 lines)
|
||
- Implements: per-CPU bitmatrix (7×32-bit banks = 224 vectors 32-255), `allocate_vector`, `free_vector`
|
||
- Lock-free CAS-based allocation with `trailing_ones()` find-first-zero
|
||
- NOTE: VECTORS table is global (not yet per-CPU sharded) — sufficient for 224 vectors
|
||
|
||
### T1.3: MSI IRQ Domain (Scheme Integration) ✅ DONE
|
||
- File: `kernel/src/scheme/irq.rs`
|
||
- Commit: `678980521` in `P8-msi.patch`
|
||
- Implements: `msi_vector_is_valid()` (32-0xEF range check), `iommu_validate_msi_irq()` hook (stub: always true), IOMMU gate at `irq_trigger()` for vectors ≥16
|
||
|
||
### T1.4: Userspace MSI Consumer (driver-sys) ✅ DONE
|
||
- File: `local/recipes/drivers/redox-driver-sys/source/src/irq.rs`
|
||
- Commit: `678980521`
|
||
- Implements: `MsiAllocation` with round-robin CPU allocation, `irq_set_affinity` (scheme write), `program_x86_message` with kernel-mediated address/vector validation (mask `0xFFF0_0000`)
|
||
- Quirk-aware fallback retained: FORCE_LEGACY, NO_MSI, NO_MSIX
|
||
|
||
### T1.5: Kernel-side MSI Affinity Handler ✅ DONE
|
||
- File: `kernel/src/scheme/irq.rs`
|
||
- Commit: `678980521` in `P8-msi.patch`
|
||
- Implements: `Handle::IrqAffinity { irq, mask }` variant, path routing for `<irq>/affinity` and `cpu-XX/<irq>/affinity`, kwrite validates CPU id and stores mask atomically, kfstat/kfpath/kreadoff/close all handle new variant
|
||
|
||
## 3. Phase 2: DMA/IOMMU Integration (Week 3-5) — AUDITED 2026-05-04
|
||
|
||
**Status**: IOMMU daemon (1003 lines) and DmaBuffer (261 lines) already exist and are solid. Tasks re-scoped from "create" to "wire."
|
||
|
||
### T2.1: IommuDmaAllocator (driver-sys) ⏳ P0
|
||
- File: `local/recipes/drivers/redox-driver-sys/source/src/dma.rs`
|
||
- Add `IommuDmaAllocator` struct: holds IOMMU domain fd, wraps `DmaBuffer::allocate()` with IOMMU MAP opcode
|
||
- Uses `scheme:iommu/domain/N` write with MAP request → get IOVA
|
||
- Linux ref: `include/linux/dma-mapping.h` — `dma_alloc_coherent()` → `iommu_dma_alloc()`
|
||
|
||
### T2.2: GPU DMA pass-through ⏳ P0
|
||
- Wire `redox-drm` GPU drivers to open IOMMU device endpoint and use IommuDmaAllocator
|
||
- amdgpu: VRAM/GTT allocations through IOMMU domain
|
||
- Intel i915: GTT pages through IOMMU domain
|
||
- Files: `local/recipes/gpu/redox-drm/source/`, `local/recipes/gpu/amdgpu/source/`
|
||
|
||
### T2.3: Streaming DMA (linux-kpi) ⏳ P1
|
||
- `dma_map_single()`: allocate bounce buffer, copy data, map through IOMMU
|
||
- `dma_unmap_single()`: copy back, unmap, free bounce buffer
|
||
- Linux ref: `kernel/dma/mapping.c` — streaming API
|
||
- File: `local/recipes/drivers/linux-kpi/source/`
|
||
|
||
### T2.4: NVMe DMA pass-through ⏳ P1
|
||
- Wire `ahcid`/`nvmed` PRP list physical addresses through IOMMU domain
|
||
- Linux ref: `drivers/nvme/host/pci.c` — `nvme_map_data()`
|
||
|
||
### T2.5: SWIOTLB Fallback (low priority) ⏳ P2
|
||
- Linux ref: `kernel/dma/swiotlb.c`
|
||
- Bounce buffer for devices with <4GB DMA addressing
|
||
- Only needed for ancient hardware; x86_64 modern hardware doesn't need it
|
||
|
||
## 4. Phase 3: Scheduler Improvements (Week 4-6) — MOSTLY DONE
|
||
|
||
### T3.1: LAPIC Timer as Primary Tick ✅ DONE
|
||
- P7-scheduler-improvements.patch: LAPIC timer calibrated + enabled at vector 48
|
||
- TSC-deadline mode, 1000Hz tick drives DWRR scheduler directly
|
||
- PIT fallback retained
|
||
|
||
### T3.2: Per-CPU Scheduler Locks ✅ DONE
|
||
- Work-stealing load balancer in switch.rs
|
||
- Per-CPU nr_running counter
|
||
- Idle CPUs steal work via IPI
|
||
|
||
### T3.3: Load Balancing ✅ DONE
|
||
- RT scheduling class (priority 0-9, skip DWRR, immediate dispatch)
|
||
- Threshold reduced: 3→1 ticks for LAPIC-driven mode
|
||
- Geometric weights in DWRR
|
||
|
||
### T3.4: RT Scheduling Class ✅ DONE
|
||
|
||
### T3.5: NUMA-Aware Scheduling ❌
|
||
- Not implemented — low priority for desktop/non-NUMA systems
|
||
- Linux ref: kernel/sched/rt.c
|
||
- FIFO and Round-Robin classes
|
||
- Priority inheritance
|
||
- RT throttling: 95% CPU cap/sec
|
||
|
||
### T3.5: TSC-Deadline Timer
|
||
- Use IA32_TSC_DEADLINE MSR for precise tick
|
||
- True tickless operation
|
||
- TSC calibration via HPET or PIT
|
||
|
||
## 5. Phase 4: Thread Creation (Week 6-7)
|
||
|
||
### T4.1: Batched Thread Creation
|
||
- Batch new-thread requests (reduce IPC)
|
||
- Pre-allocate stack pages during fork
|
||
|
||
### T4.2: Kernel Thread Pool
|
||
- Pre-create idle kernel threads
|
||
- Reuse via object pool
|
||
|
||
### T4.3: Shared Memory IPC
|
||
- Use shm for proc scheme bulk ops
|
||
- Avoid data copy through IPC channel
|
||
|
||
## 6. Dependencies
|
||
|
||
Phase 1 (MSI): T1.1 -> T1.2 -> T1.3 -> T1.4 -> T1.5
|
||
Phase 2 (DMA): T2.1 -> T2.2 -> T2.3 -> T2.4 -> T2.5
|
||
Phase 3 (Sched): T3.1 -> T3.5 -> T3.2 -> T3.3 -> T3.4
|
||
Phase 4 (Thread): T4.1 -> T4.2 -> T4.3
|
||
|
||
Phase 1+2 independent (parallel). Phase 2.4 needs Phase 1.3.
|
||
Phase 3.1 partially done (start immediately).
|
||
|
||
## 7. Timeline
|
||
|
||
| Phase | Duration | Cumulative |
|
||
|-------|----------|------------|
|
||
| Phase 1 (MSI) | 3 weeks | Week 3 |
|
||
| Phase 2 (DMA/IOMMU) | 3 weeks | Week 5 |
|
||
| Phase 3 (Scheduler) | 3 weeks | Week 7 |
|
||
| Phase 4 (Threads) | 2 weeks | Week 7 |
|
||
|
||
Total: 7 weeks (2 devs parallel Phase 1+2)
|
||
|
||
## 8. Success Metrics
|
||
|
||
| Metric | Before | After |
|
||
|--------|--------|-------|
|
||
| Scheduler tick | 148Hz (PIT) | 1000Hz (LAPIC) |
|
||
| NVMe throughput | INTx shared | MSI-X 4+ queues |
|
||
| Context switch | ~6.75ms | ~1ms |
|
||
| Thread create | 3 IPC hops | 2 IPC hops |
|
||
| DMA safety | Unprotected | IOMMU-mapped |
|