git: restore clean submodule tracking and add libredox fork

The working tree had accumulated git-tracking drift across the local/sources, local/recipes/*/source, and local/reference trees. Restored: - local/sources/libredox: add missing 160000 gitlink at d01da350 (submodule/libredox). The .gitmodules entry already declared this fork; the parent tree entry was missing so a fresh clone of the parent would not pull the libredox source. - .gitignore: mark the four local/recipes/*/source build-cache trees (uutils-tar, ninja-build, sddm, sddm/source-pristine) and the two local/reference/* entries (linux-7.1, seL4) as ignored. These are build caches and external references, not durable Red Bear code. The durable code for the four recipes is recipe.toml + the corresponding patch (redox.patch). - Note in .gitignore: do not extend local/recipes/**/source to a blanket rule, because ~150 Red Bear fork recipes do keep their durable source under local/recipes/<name>/source/. Removed six broken 160000 gitlinks: - local/recipes/archives/uutils-tar/source (e4c2affa...): on-disk working tree was a self-clone of RedBear-OS; gitlink pointed to a non-existent commit in the parent object database. - local/recipes/dev/ninja-build/source (d829f42b...): gitlink was a dangling commit on a diverged branch that has since been rewritten; the on-disk HEAD is upstream v1.13.1 (79feac0) which the recipe re-fetches via recipe.toml anyway. The 6.4MB embedded .git directory was also removed. - local/recipes/kde/sddm/source (63780fcd...): build cache for sddm 0.21.0 re-fetched via recipe.toml. The 11MB embedded .git directory was also removed. - local/recipes/kde/sddm/source-pristine (63780fcd...): empty placeholder, build cache. Removed. - local/reference/linux-7.1 (ab9de95c...): external Linux reference tree, gitignored by size. The on-disk directory is preserved per AGENTS.md 'NEVER delete the reference tree'. - local/reference/seL4 (a0b4f2d2...): empty placeholder, gitignored. Removed untracked pollution at repo root: - kernel (empty 0-byte file) - qqmljsgrammar.cpp, qqmljsgrammar_p.h, qqmljsparser.cpp, qqmljsparser_p.h (393KB total: build artifacts that escaped a qtdeclarative build into the working tree root; they belong inside the recipe source tree, not at the parent level) Added: - local/docs/MULTITHREADING-COMPREHENSIVE-ASSESSMENT-AND-PLAN.md: comprehensive multi-threading audit and implementation plan covering kernel scheduler, kernel futex, syscall ABI, relibc pthreads, and userspace threading correctness. Will drive the next implementation cycle after the git tracking work is wrapped. After this commit: - 9 submodule entries in HEAD, all of local/sources/* forks. - All previously-existing 8 fork submodules unchanged. - libredox is now durable across clones (was previously lost). - No untracked files at root. - No dangling or self-referencing gitlinks.
2026-07-02 06:04:52 +03:00
parent 3671ca573c
commit c120c3519f
9 changed files with 737 additions and 8 deletions
@@ -13,8 +13,12 @@
 # Nested recipe debris from prior build-system layouts (4.2GB+ of duplicates)
 recipes/recipes/

-# Fetched source trees in mainline recipes (not our code in local/)
-# Matches recipes/<category>/<name>/source/ but NOT local/recipes/*/source/
+# Fetched source trees in mainline recipes AND in specific local/ build-cache
+# recipes (those whose source/ is a transient working copy re-fetched by the
+# build system from the recipe's `git` URL). The durable code for these is
+# recipe.toml + local/patches/. — DO NOT add a blanket `local/recipes/**/source`
+# rule here: ~150 Red Bear recipes have durable source code under
+# `local/recipes/<name>/source/` (the fork model).
 recipes/**/source
 recipes/**/source.tmp
 recipes/**/source-new
@@ -22,6 +26,10 @@ recipes/**/source-old
 recipes/**/source.tar
 recipes/**/source.tar.tmp
 recipes/**/source.pre-preservation-test/
+local/recipes/archives/uutils-tar/source
+local/recipes/dev/ninja-build/source
+local/recipes/kde/sddm/source
+local/recipes/kde/sddm/source-pristine

 # Build artifacts — target/ dirs are everywhere
 target
@@ -31,6 +39,12 @@ wget-log
 # Vendor source trees (fetched, not our code)
 **/amdgpu-source/

+# External reference trees (read-only consultation sources). The Linux
+# reference tree (local/reference/linux-7.1) is currently kept locally
+# but is gitignored by size; seL4 reference is an empty placeholder.
+local/reference/linux-*/
+local/reference/seL4/
+
 # Compiled objects
 *.o
 *.so
@@ -0,0 +1,720 @@
+# Red Bear OS — Multi-Threading Comprehensive Assessment and Implementation Plan
+
+**Date:** 2026-07-02
+**Scope:** Full-stack multi-threading audit: hardware/SMP, kernel scheduler, kernel futex, kernel syscall ABI, relibc pthreads, userspace threading correctness and performance
+**Status:** Authoritative — supersedes `archived/KERNEL-SCHEDULER-MULTITHREAD-IMPROVEMENT-PLAN.md` and `archived/SCHEDULER-REVIEW-FINAL.md` for all threading matters
+**Validation levels:** `builds` → `enumerates` → `usable` → `validated` → `hardware-validated`
+
+---
+
+## 1. Executive Summary
+
+### The Critical Finding — Lost Threading Work
+
+The P5–P9 scheduler and futex enhancement work (documented as "complete" in the archived
+plans) was **lost during the local fork migration** (2026-06). The local forks at
+`local/sources/kernel/` and `local/sources/relibc/` were created from **upstream Redox
+baselines** that did NOT include the Red Bear enhancement patches. The patches exist in
+`local/patches/kernel/` and `local/patches/relibc/` but are **not wired into the recipes**
+(both `recipe.toml` files use `path = "..."` with no `patches = [...]` list).
+
+**Impact:** The running kernel has:
+- Baseline DWRR scheduler only (no per-CPU queues, no work stealing, no load balancing, no vruntime, no RT scheduling, no cache-affine)
+- Baseline futex only (WAIT/WAIT64/WAKE — no sharding, no PI, no REQUEUE, no robust, no WAKE_OP, no BITSET)
+- relibc `sched_*` are all `todo!()`, `pthread_setschedparam` is a no-op, robust mutexes are `todo_skip!`, PI is absent
+
+**Recovery:** 13 of 18 kernel P5–P9 patches apply cleanly to the current fork. 5 fail due to
+patch-chain dependencies (they expect earlier patches applied first). The bulk of the work is
+recoverable by re-applying patches to the forks and committing them.
+
+### What Actually Works Today
+
+| Layer | Status | Detail |
+|-------|--------|--------|
+| **SMP boot** | ✅ Solid | INIT→SIPI sequence correct, per-CPU PCR via GS_BASE, x2APIC support |
+| **Context switching** | ✅ Solid | FPU/SIMD/AVX state save via XSAVE, FSBASE/GSBASE swap (FSGSBASE or MSR), correct callee-saved register save |
+| **TLB shootdown protocol** | ✅ Correct | AtomicBool flag + IPI + ack counter with `fence(SeqCst)` race prevention |
+| **Basic thread lifecycle** | ✅ Functional | pthread_create/join/detach/exit through proc scheme + redox_rt clone |
+| **Basic synchronization** | ✅ Functional | Futex-backed mutex, condvar, rwlock, barrier, spinlock, once |
+| **TLS** | ✅ Functional | ELF PT_TLS + pthread_key_create/getspecific/setspecific |
+| **Per-CPU data** | ✅ Functional | PercpuBlock via GS_BASE, all per-CPU state accessible |
+| **Signal delivery** | ✅ Functional | Shared-memory Sigcontrol pages, per-thread masks, trampoline |
+| **Scheduler algorithm** | 🚧 Basic DWRR | 40 priority levels, geometric weights, cooperative preemption (3-tick quantum) |
+| **Futex operations** | 🚧 Basic only | WAIT/WAIT64/WAKE with single global mutex |
+| **SMP load balancing** | ❌ Missing | No work stealing, no migration, contexts stuck on birth CPU |
+| **RT scheduling** | ❌ Missing | No SCHED_FIFO/SCHED_RR, no kernel policy dispatch |
+| **Futex REQUEUE** | ❌ Missing | Condvar broadcast causes thundering herd |
+| **Robust mutexes** | ❌ Missing | Thread death while holding mutex → permanent deadlock |
+| **PI futexes** | ❌ Missing | No priority inheritance → priority inversion risk |
+| **CPU affinity API** | ❌ Missing from relibc | Kernel supports sched_affinity field but no userspace API |
+| **Thread naming** | ❌ Missing from relibc | Kernel supports name field but no userspace API |
+| **Per-page TLB flush** | ❌ Missing | `invalidate_all()` = full CR3 reload on every shootdown |
+| **NUMA awareness** | ❌ Missing | No SRAT/SLIT, no proximity domains, flat memory model |
+| **IRQ balancing** | ❌ Missing | All legacy IRQs hardwired to BSP |
+
+---
+
+## 2. Layer-by-Layer Assessment
+
+### 2.1 Hardware / SMP Layer
+
+**Files:** `src/acpi/madt/arch/x86.rs`, `src/arch/x86_shared/start.rs`,
+`src/arch/x86_shared/device/local_apic.rs`, `src/arch/x86_shared/device/ioapic.rs`,
+`src/arch/x86_shared/ipi.rs`, `src/arch/x86_shared/interrupt/ipi.rs`, `src/percpu.rs`,
+`src/arch/x86_shared/gdt.rs`
+
+**Verdict: Functional foundation, performance gaps.**
+
+| Component | Status | Detail |
+|-----------|--------|--------|
+| AP boot (INIT/SIPI) | ✅ validated | Correct trampoline at 0x8000, per-AP PCR/IDT/stack allocation |
+| x2APIC mode | ✅ builds | Detected via CPUID, MSR-based access, APIC ID detection |
+| Per-CPU PCR via GS_BASE | ✅ validated | `PercpuBlock::current()` reads from PCR, SWAPGS protocol correct |
+| IPI send/receive | ✅ functional | 5 IPI kinds (Wakeup/Tlb/Switch/Pit/Profile), broadcast + unicast |
+| TLB shootdown | ✅ correct | AtomicBool + IPI + ack with `fence(SeqCst)` race prevention |
+| TLB granularity | ❌ coarse | Full CR3 reload (`mov cr3, cr3`) on every shootdown — no INVLPG |
+| TLB broadcast | 🚧 sequential | Iterates CPUs individually, doesn't use ICR "all excluding self" shorthand |
+| IRQ routing | ❌ BSP-only | Legacy I/O APIC entries hardcode `dest: bsp_apic_id` |
+| NUMA | ❌ absent | No SRAT/SLIT, no proximity domains |
+| SMT/HT topology | ❌ absent | No cache hierarchy, no hyperthread awareness |
+| Idle loop | ✅ functional | MWAIT with deepest C-state or HLT fallback |
+| W^X for trampoline | 🚧 minor | Trampoline page briefly W+X, unmapped after AP boot |
+
+### 2.2 Kernel Scheduler Layer
+
+**Files:** `src/context/switch.rs`, `src/context/mod.rs`, `src/context/context.rs`,
+`src/context/timeout.rs`
+
+**Verdict: Correct but primitive — DWRR only, no SMP balancing, no RT classes.**
+
+**Algorithm:** Deficit Weighted Round Robin (DWRR)
+- 40 priority levels, each a `VecDeque<WeakContextRef>`
+- Geometric weights: `SCHED_PRIO_TO_WEIGHT[i] ≈ 1.25^i` (88761 → 15)
+- Per-CPU `balance` accumulator drives dequeue decisions
+- Quantum: 3 PIT ticks (~12.2ms) per scheduling round
+- Cooperative preemption: `preempt_locks > 0` disables preemption
+
+**Global locks:**
+- `RUN_CONTEXTS: Mutex<L1, RunContextData>` — all 40 priority queues under one L1 lock
+- `IDLE_CONTEXTS: Mutex<L2, VecDeque<WeakContextRef>>` — sleeping contexts
+- `CONTEXT_SWITCH_LOCK: AtomicBool` — global CAS spinlock serializing all context switches
+
+**What's missing (all was in lost P5–P9 work):**
+
+| Gap | Lost Patch | Recoverable? |
+|-----|-----------|-------------|
+| Per-CPU run queues (eliminate global L1) | P6-percpu-runqueues, P8-percpu-sched, P8-percpu-wiring | ✅ applies cleanly |
+| Work stealing | P8-work-stealing | ❌ needs rebase (depends on per-CPU wiring) |
+| Initial placement (least-loaded CPU) | P8-initial-placement | ✅ applies cleanly |
+| Load balancing | P8-load-balance (absorbed) | needs verification |
+| Vruntime tracking + min-vruntime selection | P6-vruntime-switch | ✅ applies cleanly |
+| SchedPolicy enum (FIFO/RR/Other) | P5-sched-rt-policy | ✅ applies cleanly |
+| RT scheduling dispatch | P5-sched-rt-policy | ✅ applies cleanly |
+| Cache-affine scheduling | P7-cache-affine-switch | ✅ applies cleanly |
+| NUMA topology hints | P9-numa-topology | ✅ applies cleanly |
+
+### 2.3 Kernel Futex Layer
+
+**File:** `src/syscall/futex.rs`
+
+**Verdict: Baseline only — critical operations missing for desktop workloads.**
+
+| Operation | Status | Impact of Absence |
+|-----------|--------|-------------------|
+| `FUTEX_WAIT` (32-bit) | ✅ | — |
+| `FUTEX_WAIT64` (64-bit) | ✅ | — |
+| `FUTEX_WAKE` | ✅ | — |
+| `FUTEX_REQUEUE` | ❌ returns EINVAL | `pthread_cond_broadcast` wakes ALL waiters (thundering herd) |
+| `FUTEX_CMP_REQUEUE` | ❌ not defined | Same + atomicity gap |
+| `FUTEX_WAKE_OP` | ❌ not defined | glibc mutex fast path unavailable |
+| `FUTEX_WAIT_BITSET` | ❌ not defined | `pselect`/`ppoll` optimization unavailable |
+| `FUTEX_WAKE_BITSET` | ❌ not defined | Targeted wake unavailable |
+| `FUTEX_LOCK_PI` / `UNLOCK_PI` | ❌ not defined | Priority inversion unprotected |
+| Robust futex list | ❌ not defined | Thread death → permanent deadlock |
+| Futex sharding (per-futex lock) | ❌ single global L1 mutex | All futex ops on all CPUs contend on one lock |
+| Process-private futexes | ❌ global table | Unnecessary cross-process visibility |
+
+**Architecture:**
+```
+static FUTEXES: Mutex<L1, FutexList>  // single global lock
+type FutexList = HashMap<PhysicalAddress, Vec<FutexEntry>>
+```
+
+Physical address is the key (enables cross-address-space futex via MAP_SHARED).
+Virtual address + Weak<AddrSpaceWrapper> used for CoW disambiguation.
+
+**Recoverable work (lost patches):**
+
+| Feature | Lost Patch | Applies? |
+|---------|-----------|----------|
+| 64-shard hash table | P6-futex-sharding | ✅ cleanly |
+| FUTEX_REQUEUE + CMP_REQUEUE | P8-futex-requeue | ❌ needs rebase |
+| PI futex (LOCK_PI/UNLOCK_PI/TRYLOCK_PI) | P8-futex-pi | ❌ needs rebase |
+| PI CAS fix | P9-futex-pi-cas-fix | ❌ needs rebase |
+| Robust futex list | P8-futex-robust | ❌ needs rebase |
+
+The 4 failing patches likely fail because they depend on sharding (P6-futex-sharding) being
+applied first. Apply in order: P6-sharding → P8-requeue → P8-pi → P8-robust → P9-pi-cas-fix.
+
+### 2.4 Kernel Syscall ABI Layer
+
+**Files:** `src/syscall/mod.rs`, `src/syscall/futex.rs`, `src/syscall/time.rs`,
+`src/syscall/process.rs`, `local/sources/syscall/src/number.rs`, `src/scheme/proc.rs`
+
+**Verdict: Minimal surface — most threading done via proc scheme, not syscalls.**
+
+The kernel defines only ~35 syscall numbers. Threading-relevant ones:
+
+| Syscall | Status | Notes |
+|---------|--------|-------|
+| `SYS_FUTEX` (240) | ✅ partial | WAIT/WAIT64/WAKE only |
+| `SYS_YIELD` (158) | ✅ | `context::switch()` + signal handler |
+| `SYS_FMAP` (900) | ✅ | Anonymous + file-backed mmap |
+| `SYS_FUNMAP` (92) | ✅ | munmap |
+| `SYS_MPROTECT` (125) | ✅ | |
+| `SYS_MREMAP` (155) | ✅ | |
+| `SYS_NANOSLEEP` (162) | ✅ | EINTR-aware |
+| `SYS_CLOCK_GETTIME` (265) | ✅ partial | REALTIME + MONOTONIC only |
+
+**Threading done via proc scheme (not syscalls):**
+
+| Operation | Mechanism |
+|-----------|-----------|
+| Thread/process creation | `proc:` scheme: open "new-context", share addr_space + files via kdup |
+| waitpid | `proc:` scheme: `EVENT_READ` on context fd |
+| getpid/gettid | `proc:` scheme: read "attrs" handle |
+| kill/tkill | `proc:` scheme: `ForceKill` / `Interrupt` ContextVerb |
+| CPU affinity | `proc:` scheme: write "sched-affinity" handle |
+| Priority | `proc:` scheme: write "attrs" prio field |
+| Signal setup | `proc:` scheme: write "sighandler" + shared Sigcontrol pages |
+| TLS base (FSBASE) | `proc:` scheme: write "regs/env" EnvRegisters |
+
+**Completely missing syscalls (no number, no handler):**
+`clone`, `fork`, `vfork`, `waitpid`, `wait4`, `kill`, `tkill`, `tgkill`, `arch_prctl`,
+`set_thread_area`, `set_tid_address`, `set_robust_list`, `get_robust_list`,
+`sched_setaffinity`, `sched_getaffinity`, `sched_setscheduler`, `sched_getparam`,
+`sigaction`, `sigprocmask`, `sigpending`, `sigsuspend`, `sigtimedwait`,
+`timer_create`, `timer_settime`, `timer_delete`, `timerfd_create`,
+`getrusage`, `setrlimit`, `getrlimit`, `times`
+
+### 2.5 relibc Pthread Layer
+
+**Files:** `src/pthread/mod.rs`, `src/sync/*.rs`, `src/header/pthread/*.rs`,
+`src/header/sched/mod.rs`, `src/ld_so/tcb.rs`, `src/platform/redox/mod.rs`
+
+**Verdict: Core pthreads solid, scheduling/robust/PI absent, several POSIX gaps.**
+
+#### Fully Working (futex-backed)
+
+| API Group | Backend | Notes |
+|-----------|---------|-------|
+| `pthread_create/join/detach/exit` | redox_rt clone + Waitval | Stack via mmap, TLS via Tcb::new() |
+| `pthread_cancel/setcancelstate/testcancel` | SIGRT_RLCT_CANCEL (33) | Deferred cancellation only |
+| `pthread_mutex_*` (normal/recursive/errorcheck) | AtomicU32 CAS + futex_wait/wake | 3-state: unlocked/locked/waiters |
+| `pthread_cond_*` | Two-counter futex design | CLOCK_REALTIME only (monotonic = stub) |
+| `pthread_rwlock_*` | AtomicU32 + futex | Reader count + WAITING_WR bit |
+| `pthread_barrier_*` | Mutex + Cond | gen_id wrapping counter |
+| `pthread_spin_*` | AtomicI32 CAS | No futex, pure spinning |
+| `pthread_once` | 3-state futex (UNINIT→INITING→INIT) | |
+| `pthread_key_create/getspecific/setspecific/delete` | BTreeMap global + thread_local values | Destructor iteration per POSIX |
+| `pthread_sigmask` | Delegates to sigprocmask | |
+| `pthread_kill` | redox_rt::rlct_kill | |
+| `pthread_atfork` | Thread-local LinkedList hooks | |
+| ELF TLS (`__thread` / `#[thread_local]`) | PT_TLS + Tcb | Static + dynamic DTV for dlopen |
+| `pthread_attr_*` (getters/setters) | RlctAttr struct | |
+
+#### Stubs / No-ops / Missing
+
+| API | Status | Root Cause |
+|-----|--------|------------|
+| `sched_get_priority_max/min` | `todo!()` | Kernel has no scheduling policy API |
+| `sched_getparam/setparam` | `todo!()` | Same |
+| `sched_setscheduler` | `todo!()` | Same |
+| `sched_rr_get_interval` | `todo!()` | Same |
+| `pthread_setschedparam` | No-op (returns Ok) | Kernel ignores policy |
+| `pthread_setschedprio` | No-op (returns Ok) | Kernel ignores priority change |
+| `pthread_getschedparam` | `todo!()` | |
+| `pthread_getcpuclockid` | ENOENT | No per-thread CPU clock |
+| `pthread_mutex_consistent` | `todo_skip!` | Robust mutex not implemented |
+| `pthread_mutex_getprioceiling` | `todo_skip!` | Priority ceiling not implemented |
+| `pthread_mutex_setprioceiling` | `todo_skip!` | Same |
+| `pthread_mutexattr_setprotocol` (PRIO_INHERIT) | Accepted, no-op | PI futex missing |
+| `pthread_mutexattr_setrobust` (ROBUST) | Accepted, no-op | Robust futex missing |
+| `pthread_cond_init` CLOCK_MONOTONIC | `todo_skip!` | |
+| `pthread_cond_signal` | Calls broadcast (wakes ALL) | Missing FUTEX_REQUEUE optimization |
+| `pthread_setaffinity_np` | Not defined | |
+| `pthread_getaffinity_np` | Not defined | |
+| `pthread_setname_np` | Not defined | |
+| `pthread_getname_np` | Not defined | |
+| `pthread_setcanceltype` | Always returns DEFERRED | ASYNC not tracked |
+| Guard pages | Attribute stored, not mapped | No PROT_NONE page before stack |
+| PTHREAD_KEYS_MAX limit | Not checked | |
+
+---
+
+## 3. Gap Classification
+
+### 3.1 Correctness Gaps (Must Fix — Silent Data Corruption or Deadlock)
+
+| # | Gap | Impact | Fix Location |
+|---|-----|--------|-------------|
+| C1 | **No robust mutexes** | Thread death while holding mutex → permanent deadlock for all waiters | Kernel: robust futex list + relibc: pthread_mutex_consistent |
+| C2 | **No PI futexes** | Priority inversion: low-prio thread blocks high-prio thread indefinitely | Kernel: FUTEX_LOCK_PI/UNLOCK_PI + relibc: mutexattr_setprotocol |
+| C3 | **`pthread_cond_signal` wakes ALL** | Correctness: wastes CPU. Performance: thundering herd on every signal | relibc: use true wake(1) — may need FUTEX_REQUEUE |
+| C4 | **`fork()` not thread-safe** | `pthread_atfork` hooks exist but child inherits locked mutexes | relibc: implement atfork child handlers properly |
+
+### 3.2 Performance Gaps (Must Fix for Desktop Responsiveness)
+
+| # | Gap | Impact | Fix Location |
+|---|-----|--------|-------------|
+| P1 | **No SMP load balancing** | Cores sit idle while others are overloaded | Kernel: work stealing + initial placement |
+| P2 | **No futex sharding** | Single global L1 mutex for ALL futex ops on ALL CPUs | Kernel: 64-shard hash table |
+| P3 | **No FUTEX_REQUEUE** | `pthread_cond_broadcast` wakes all → thundering herd | Kernel: REQUEUE + CMP_REQUEUE |
+| P4 | **Full TLB flush on every shootdown** | Per-page mprotect/munmap flushes entire TLB on all cores | Kernel: INVLPG-based selective flush |
+| P5 | **Global context switch lock** | Serialization bottleneck beyond ~8 cores | Kernel: per-CPU context switch (needs per-CPU run queues) |
+| P6 | **All IRQs to BSP** | CPU 0 handles all interrupts, cache thrash, latency | Kernel: IRQ steering in I/O APIC + MSI/MSI-X dest field |
+| P7 | **No RT scheduling** | Audio/compositor threads can't get priority | Kernel: SchedPolicy + RT dispatch + relibc: sched_setscheduler |
+
+### 3.3 POSIX Completeness Gaps (Must Fix for Application Compatibility)
+
+| # | Gap | Impact | Fix Location |
+|---|-----|--------|-------------|
+| X1 | `sched_*` all `todo!()` | Applications calling sched_setscheduler panic | relibc: implement via proc scheme |
+| X2 | `pthread_setschedparam` no-op | Apps can't change thread priority | relibc: wire to proc scheme prio write |
+| X3 | `pthread_setaffinity_np` missing | Apps can't pin threads to CPUs | relibc: implement via proc scheme affinity write |
+| X4 | `pthread_setname_np` missing | Debugging harder (no thread names in /proc) | relibc: implement via proc scheme name write |
+| X5 | `pthread_getcpuclockid` ENOENT | Per-thread profiling impossible | relibc + kernel: expose cpu_time via clock |
+| X6 | Guard pages not mapped | Stack overflow → silent corruption, no SIGSEGV | relibc: mmap PROT_NONE guard page in pthread_create |
+| X7 | `pthread_cond_init` monotonic stub | CLOCK_MONOTONIC condvars use REALTIME (affected by wall clock jumps) | relibc: implement monotonic condvar |
+
+---
+
+## 4. Implementation Plan
+
+### Phase 0: Patch Recovery — Re-Apply Lost Threading Work (Week 1–2)
+
+**Goal:** Recover the P5–P9 work that was lost during the local fork migration.
+
+**This is the highest-priority phase — it restores ~6 months of work with minimal new code.**
+
+#### 0.1 — Re-apply kernel scheduler patches to local fork
+
+Apply in dependency order to `local/sources/kernel/`:
+
+| Order | Patch | Status | Action |
+|-------|-------|--------|--------|
+| 1 | P6-futex-sharding | ✅ applies | Commit directly |
+| 2 | P6-percpu-runqueues | ✅ applies | Commit directly |
+| 3 | P8-percpu-sched | ✅ applies | Commit directly |
+| 4 | P8-percpu-wiring | ✅ applies | Commit directly |
+| 5 | P8-initial-placement | ✅ applies | Commit directly |
+| 6 | P5-sched-rt-policy | ✅ applies | Commit directly |
+| 7 | P5-context-mod-sched | ✅ applies | Commit directly |
+| 8 | P6-vruntime-switch | ✅ applies | Commit directly |
+| 9 | P7-cache-affine-switch | ✅ applies | Commit directly |
+| 10 | P9-numa-topology | ✅ applies | Commit directly |
+| 11 | P9-proc-lock-ordering | ✅ applies | Commit directly |
+| 12 | P8-work-stealing | ❌ needs rebase | Rebase against 1–11, then apply |
+| 13 | P8-futex-requeue | ❌ needs rebase | Rebase against P6-sharding (#1), then apply |
+| 14 | P8-futex-pi | ❌ needs rebase | Rebase against #13, then apply |
+| 15 | P8-futex-robust | ❌ needs rebase | Rebase against #14, then apply |
+| 16 | P9-futex-pi-cas-fix | ❌ needs rebase | Rebase against #14, then apply |
+| 17 | P7-scheduler-improvements | ❌ needs rebase | Rebase against 1–11, then apply |
+
+**Verification after each patch:**
+```bash
+cd local/sources/kernel
+cargo check  # must pass
+```
+
+#### 0.2 — Re-apply relibc threading patches to local fork
+
+Apply to `local/sources/relibc/`:
+
+| Patch | Action |
+|-------|--------|
+| P3-threads.patch | ✅ applies — commit |
+| P3-barrier-smp-futex (from absorbed/) | Verify already in fork; if not, apply |
+| P3-pthread-signal-races (from absorbed/) | Verify already in fork |
+| P3-pthread-yield (from absorbed/) | Verify already in fork |
+| P5-robust-mutexes (from absorbed/) | Verify; re-apply if missing |
+| P5-robust-mutex-enotrec-fix (from absorbed/) | Same |
+| P5-sched-api (from absorbed/) | Same |
+| P7-pthread-affinity (from absorbed/) | Same |
+| P7-pthread-setname (from absorbed/) | Same |
+| P7-setpriority (from absorbed/) | Same |
+| P9-spin-and-barrier (from absorbed/) | Same |
+| P9-spin-fix (from absorbed/) | Same |
+| P3-semaphore-comprehensive | ✅ applies |
+
+**Verification:**
+```bash
+cd local/sources/relibc
+make all  # must pass
+touch relibc && make prefix  # rebuild prefix with new libc
+```
+
+#### 0.3 — Build and smoke test
+
+```bash
+export REDBEAR_ALLOW_PROTECTED_FETCH=1
+./local/scripts/build-redbear.sh --upstream redbear-mini
+make qemu  # verify boot + basic operation
+```
+
+**Success criteria:** redbear-mini boots, multi-threaded daemons (pcid, xhcid) start, no kernel panic.
+
+---
+
+### Phase 1: Futex Completeness (Week 2–4)
+
+**Goal:** Close the futex operation gaps that affect correctness and performance.
+
+**Depends on:** Phase 0 complete (sharding applied first).
+
+#### 1.1 — FUTEX_REQUEUE + FUTEX_CMP_REQUEUE
+
+**Kernel:** `src/syscall/futex.rs`
+- Add `FUTEX_REQUEUE` and `FUTEX_CMP_REQUEUE` to the futex dispatcher
+- Implement: move up to `val` waiters from addr1 → addr2, optionally compare `*addr1 == val2`
+- Requires locking TWO shards (acquire both in deterministic order to avoid deadlock)
+
+**relibc:** `src/sync/cond.rs`
+- Change `pthread_cond_broadcast` to use `FUTEX_REQUEUE` (move waiters from condvar futex to mutex futex)
+- Change `pthread_cond_signal` to wake exactly 1 (not all)
+
+**Impact:** Eliminates thundering herd on every `pthread_cond_broadcast`. Major win for Qt event loop, KWin compositor, Mesa worker threads.
+
+#### 1.2 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI)
+
+**Kernel:** `src/syscall/futex.rs`
+- Add `PiState` tracking per futex: owner context + waiter list with priorities
+- On `LOCK_PI` block: boost owner's priority to waiter's priority
+- On `UNLOCK_PI`: restore original priority, wake highest-priority waiter
+- Requires kernel RT scheduling (Phase 0.1 #6–7: P5-sched-rt-policy)
+
+**relibc:** `src/sync/pthread_mutex.rs`
+- Implement `PTHREAD_PRIO_INHERIT` protocol path using PI futex
+- Replace `todo_skip!` in `pthread_mutex_consistent` with real implementation
+
+#### 1.3 — Robust Futex List
+
+**Kernel:** `src/syscall/futex.rs` + `src/context/context.rs`
+- Add `robust_list_head: Option<usize>` to `Context` struct
+- Implement `set_robust_list` / `get_robust_list` via proc scheme or syscall
+- On thread exit (`exit_this_context`): walk robust list, set `FUTEX_OWNER_DIED` bit, wake one waiter with `EOWNERDEAD`
+
+**relibc:** `src/sync/pthread_mutex.rs`
+- Implement robust list registration in `pthread_mutex_lock`
+- Implement `pthread_mutex_consistent`: clear `EOWNERDEAD` state
+- Replace `todo_skip!` with real implementation
+
+#### 1.4 — FUTEX_WAKE_OP
+
+**Kernel:** `src/syscall/futex.rs`
+- Implement atomic op + wake: perform op on addr2, then wake up to `val` waiters on addr1
+- Operations: set, add, or, andn, xor, with comparison condition
+
+**Impact:** glibc mutex fast path optimization. Not critical for relibc but helps ported glibc-linked binaries.
+
+---
+
+### Phase 2: SMP Scheduling Quality (Week 3–6)
+
+**Goal:** Make multi-core actually distribute work.
+
+**Depends on:** Phase 0 complete (per-CPU queues applied).
+
+#### 2.1 — Work stealing (recover + fix)
+
+**Kernel:** `src/context/switch.rs`
+- On `select_next_context()` empty local queue: steal from victim CPU
+- Pick victim by round-robin, steal highest-priority runnable context
+- Limit steal batch size (1–2 contexts per steal attempt)
+- Send `IpiKind::Wakeup` to target CPU if stealing woke it from idle
+
+**Recovery:** P8-work-stealing needs rebase against per-CPU wiring.
+
+#### 2.2 — Load balancing (recover + verify)
+
+**Kernel:** `src/context/switch.rs`
+- Periodic balance trigger (every N ticks or when queue depth difference > threshold)
+- Migrate contexts from overloaded CPU to most-idle CPU
+- Respect `sched_affinity` mask during migration
+
+**Recovery:** P8-load-balance is in absorbed/ — verify it's in the fork after Phase 0.
+
+#### 2.3 — Reschedule IPI
+
+**Kernel:** `src/arch/x86_shared/ipi.rs` + `src/context/switch.rs`
+- When waking a context on a different CPU, send `IpiKind::Switch` to that CPU
+- Currently the Switch IPI exists but is not used by the scheduler
+
+#### 2.4 — Per-page TLB flush (INVLPG)
+
+**Kernel:** `rmm/src/arch/x86_64.rs` + `src/context/memory.rs`
+- Add `invalidate_page(addr)` using `invlpg` instruction
+- Modify `Flusher` to track individual pages and use INVLPG when ≤ N pages affected
+- Fall back to CR3 reload only for large-scale invalidations
+
+**Impact:** Every `mprotect`/`mmap`/`munmap` on a multi-threaded process currently flushes the ENTIRE TLB on every core. This is one of the most impactful single fixes.
+
+#### 2.5 — TLB broadcast optimization
+
+**Kernel:** `src/percpu.rs`
+- Replace per-CPU sequential `shootdown_tlb_ipi(Some(id))` loop with ICR "all excluding self" (destination shorthand 0b11)
+- Single IPI + global ack counter instead of N individual IPIs + N ack counters
+
+---
+
+### Phase 3: RT Scheduling (Week 4–6)
+
+**Goal:** Allow applications to request real-time scheduling for latency-sensitive threads.
+
+**Depends on:** Phase 0 (SchedPolicy applied) + Phase 2 (per-CPU queues).
+
+#### 3.1 — Kernel RT scheduling dispatch
+
+**Kernel:** `src/context/switch.rs` (from P5-sched-rt-policy — recovered in Phase 0)
+- `select_next_context()` passes:
+  1. SCHED_FIFO contexts (highest RT priority first, no preemption within same prio)
+  2. SCHED_RR contexts (highest RT priority first, round-robin within same prio)
+  3. SCHED_OTHER contexts (existing DWRR/vruntime)
+- SCHED_RR quantum: configurable per-context (default 100ms)
+
+#### 3.2 — relibc sched_* API completion
+
+**relibc:** `src/header/sched/mod.rs`
+
+Replace ALL `todo!()` stubs:
+
+| Function | Implementation |
+|----------|---------------|
+| `sched_getscheduler(pid)` | Read policy from proc scheme attrs |
+| `sched_setscheduler(pid, policy, param)` | Write policy + RT priority via proc scheme |
+| `sched_getparam(pid, param)` | Read RT priority from proc scheme |
+| `sched_setparam(pid, param)` | Write RT priority via proc scheme |
+| `sched_get_priority_max(policy)` | Return 99 for FIFO/RR, 0 for OTHER |
+| `sched_get_priority_min(policy)` | Return 1 for FIFO/RR, 0 for OTHER |
+| `sched_rr_get_interval(pid, tp)` | Return SCHED_RR quantum (100ms default) |
+
+#### 3.3 — pthread_setschedparam wiring
+
+**relibc:** `src/pthread/mod.rs`
+- Replace `set_sched_param` no-op with real proc scheme call
+- Replace `set_sched_priority` no-op with real proc scheme call
+
+---
+
+### Phase 4: POSIX Pthread Completeness (Week 5–8)
+
+**Goal:** Close remaining POSIX gaps that block application compatibility.
+
+**Depends on:** Phase 0 + Phase 3 (for sched API).
+
+#### 4.1 — pthread_setaffinity_np / pthread_getaffinity_np
+
+**relibc:** `src/header/pthread/mod.rs` + `src/header/sched/mod.rs`
+- Implement using proc scheme "sched-affinity" write/read
+- Define `cpu_set_t` type and `CPU_SET/CPU_CLR/CPU_ZERO/CPU_ISSET` macros
+
+#### 4.2 — pthread_setname_np / pthread_getname_np
+
+**relibc:** `src/header/pthread/mod.rs`
+- Implement using proc scheme name write/read (kernel already supports 32-char name field)
+
+#### 4.3 — pthread_cond_init CLOCK_MONOTONIC
+
+**relibc:** `src/sync/cond.rs`
+- Replace `todo_skip!` with real monotonic clock support
+- Store clock choice in cond struct, use `CLOCK_MONOTONIC` for deadline calculations
+
+#### 4.4 — Guard pages
+
+**relibc:** `src/pthread/mod.rs`
+- In `pthread_create`, when allocating stack via mmap:
+  - Map `[stack_base, stack_base + guard_size)` with `PROT_NONE`
+  - Map `[stack_base + guard_size, stack_base + guard_size + stack_size)` with `PROT_READ | PROT_WRITE`
+- On thread exit, munmap both regions
+
+#### 4.5 — pthread_getcpuclockid
+
+**relibc:** `src/header/pthread/mod.rs`
+- Return `CLOCK_THREAD_CPUTIME_ID` (requires kernel support — add clock to `clock_gettime`)
+
+**Kernel:** `src/syscall/time.rs`
+- Add `CLOCK_THREAD_CPUTIME_ID` → read `context.cpu_time`
+
+#### 4.6 — PTHREAD_KEYS_MAX enforcement
+
+**relibc:** `src/header/pthread/tls.rs`
+- Check `NEXTKEY` against `PTHREAD_KEYS_MAX` (1024) before allocating
+
+---
+
+### Phase 5: IRQ Steering and NUMA (Week 8–12)
+
+**Goal:** Distribute interrupt load and respect memory locality.
+
+**Depends on:** Phase 2 (per-CPU infrastructure).
+
+#### 5.1 — IRQ steering
+
+**Kernel:** `src/arch/x86_shared/device/ioapic.rs` + `src/arch/x86_shared/idt.rs`
+- Change I/O APIC redirection `dest` from `bsp_apic_id` to round-robin or RSS hash
+- Add per-CPU legacy IRQ handlers in IDT (not just BSP)
+- For MSI/MSI-X: set destination CPU in Message Address register
+
+#### 5.2 — NUMA topology discovery
+
+**Kernel:** `src/acpi/` (from P9-numa-topology — recovered in Phase 0)
+- Parse SRAT (Static Resource Affinity Table) for proximity domains
+- Parse SLIT (System Locality Distance Information Table) for inter-node distances
+- Store `NumaTopology` in kernel for O(1) scheduling lookups
+
+#### 5.3 — NUMA-aware memory allocation
+
+**Kernel:** `src/memory/` + frame allocator
+- Track frame NUMA node in `Frame` or `PageInfo`
+- On allocation, prefer frames from requesting CPU's NUMA node
+- Fallback to remote node when local node is exhausted
+
+---
+
+## 5. Dependency Chain
+
+```
+Phase 0 (Patch Recovery) ← BLOCKING FOR ALL OTHERS
+    │
+    ├──► Phase 1 (Futex Completeness)
+    │       │
+    │       ├──► 1.1 REQUEUE ──► condvar performance
+    │       ├──► 1.2 PI ──► priority inversion fix (needs Phase 3.1)
+    │       ├──► 1.3 Robust ──► deadlock prevention
+    │       └──► 1.4 WAKE_OP ──► glibc compat
+    │
+    ├──► Phase 2 (SMP Scheduling)
+    │       │
+    │       ├──► 2.1 Work stealing ──► core utilization
+    │       ├──► 2.2 Load balancing ──► fair distribution
+    │       ├──► 2.3 Reschedule IPI ──→ cross-CPU wakeup
+    │       ├──► 2.4 Per-page TLB ──► mmap/mprotect performance
+    │       └──► 2.5 TLB broadcast ──► IPI efficiency
+    │
+    ├──► Phase 3 (RT Scheduling)
+    │       │
+    │       ├──► 3.1 Kernel RT dispatch (from Phase 0)
+    │       ├──► 3.2 relibc sched_* API ──► POSIX compat
+    │       └──► 3.3 pthread_setschedparam ──► app priority control
+    │
+    ├──► Phase 4 (POSIX Pthread Completeness)
+    │       │
+    │       ├──► 4.1 Affinity API ──► CPU pinning
+    │       ├──► 4.2 Thread naming ──► debuggability
+    │       ├──► 4.3 Monotonic condvar ──► clock correctness
+    │       ├──► 4.4 Guard pages ──► stack overflow detection
+    │       ├──► 4.5 CPU clock ──► per-thread profiling
+    │       └──► 4.6 Keys max ──► resource limit
+    │
+    └──► Phase 5 (IRQ + NUMA)
+            │
+            ├──► 5.1 IRQ steering ──► interrupt distribution
+            ├──► 5.2 NUMA topology ──► (from Phase 0)
+            └──► 5.3 NUMA allocator ──► memory locality
+```
+
+**Parallel work possible:**
+- Phase 1 + Phase 2 + Phase 3 can run in parallel after Phase 0
+- Phase 4 items are independent of each other
+- Phase 5 depends on Phase 2 but not on Phase 1/3/4
+
+---
+
+## 6. Validation Plan
+
+### 6.1 Build Evidence
+
+| Check | Command |
+|-------|---------|
+| Kernel compiles | `make r.kernel` |
+| relibc compiles | `make r.relibc` |
+| Prefix rebuilt | `touch relibc kernel && make prefix` |
+| Full OS builds | `make all CONFIG_NAME=redbear-mini` |
+
+### 6.2 Runtime Evidence (QEMU)
+
+| Test | Verification |
+|------|-------------|
+| Multi-threaded boot | `make qemu QEMUFLAGS="-smp 4"` — all 4 CPUs active |
+| pthread smoke test | Guest: compile + run simple pthread_create/join/mutex test |
+| Work stealing | Guest: spawn 8 threads on 4-CPU QEMU, verify all CPUs utilized |
+| Futex REQUEUE | Guest: condvar broadcast benchmark — waiters wake in ≤2 batches, not N |
+| PI futex | Guest: priority inversion test — high-prio thread unblocked within 1 tick |
+| Robust mutex | Guest: kill thread holding mutex, verify EOWNERDEAD recovery |
+| RT scheduling | Guest: SCHED_FIFO thread preempts SCHED_OTHER within 100μs |
+| CPU affinity | Guest: pin thread to CPU 1, verify it never runs on CPU 0 |
+| Thread naming | Guest: `cat /scheme/proc/*/name` shows set names |
+| Guard pages | Guest: overflow stack, verify SIGSEGV (not silent corruption) |
+| TLB efficiency | Guest: mprotect benchmark — compare TLB miss rate before/after |
+
+### 6.3 Validation Scripts (to create)
+
+```bash
+local/scripts/test-threading-qemu.sh          # Comprehensive threading smoke test
+local/scripts/test-futex-requeue-qemu.sh      # REQUEUE-specific test
+local/scripts/test-futex-pi-qemu.sh           # PI futex test
+local/scripts/test-futex-robust-qemu.sh       # Robust mutex test
+local/scripts/test-sched-rt-qemu.sh           # RT scheduling latency test
+local/scripts/test-sched-balance-qemu.sh      # Load balancing on multi-vCPU
+local/scripts/test-threading-baremetal.sh     # Bare metal multi-threaded stress
+```
+
+---
+
+## 7. Estimated Effort
+
+| Phase | Duration | New Code | Recovery | Dependencies |
+|-------|----------|----------|----------|-------------|
+| Phase 0: Patch Recovery | 1–2 weeks | Minimal (rebase 5 patches) | 13 patches apply directly | None |
+| Phase 1: Futex Completeness | 2–3 weeks | REQUEUE impl + WAKE_OP | PI/robust from P8 patches | Phase 0 |
+| Phase 2: SMP Scheduling | 3–4 weeks | TLB INVLPG + broadcast opt | Work stealing from P8 | Phase 0 |
+| Phase 3: RT Scheduling | 1–2 weeks | relibc sched_* API | RT dispatch from P5 | Phase 0 |
+| Phase 4: POSIX Pthread | 2–3 weeks | Affinity/naming/guard/clock | Partial from P7 patches | Phase 0, 3 |
+| Phase 5: IRQ + NUMA | 3–4 weeks | IRQ steering + NUMA allocator | NUMA topology from P9 | Phase 0, 2 |
+
+**Total:** 12–18 weeks with 1–2 developers. Phase 0 alone recovers the majority of the value in 1–2 weeks.
+
+---
+
+## 8. Integration with Existing Plans
+
+| Plan | Relationship |
+|------|-------------|
+| `CONSOLE-TO-KDE-DESKTOP-PLAN.md` | **Consumer** — Phase 3 (KWin) needs PI futex + RT scheduling; Phase 2 (compositor) needs work stealing |
+| `IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md` | **Sibling** — IRQ steering (Phase 5.1) belongs to both plans |
+| `DRM-MODERNIZATION-EXECUTION-PLAN.md` | **Consumer** — GPU worker threads benefit from load balancing + affinity |
+| `IMPLEMENTATION-MASTER-PLAN.md` | **Parent** — this plan covers the kernel threading substrate |
+| `CPU-DMA-IRQ-MSI-SCHEDULER-FIX-PLAN.md` | **Sibling** — overlaps on scheduler/IRQ delivery |
+
+---
+
+## 9. Bottom Line
+
+The Red Bear OS threading stack is **functional for basic single-threaded and lightly-threaded
+workloads**. The SMP boot, context switching, TLB shootdown, and basic futex operations are
+correct.
+
+The **critical problem** is that 6 months of threading enhancement work (P5–P9 patches) was
+lost during the local fork migration. This work exists as patch files that apply cleanly to
+the current fork — **Phase 0 (Patch Recovery) is the single highest-ROI action**.
+
+After Phase 0, the remaining gaps are:
+1. **Futex REQUEUE/PI/robust** — for condvar performance and deadlock prevention
+2. **SMP work stealing + load balancing** — for multi-core utilization
+3. **RT scheduling** — for audio/compositor thread priority
+4. **POSIX pthread completeness** — for application compatibility
+5. **IRQ steering + NUMA** — for multi-socket performance
+
+The **desktop-critical path** (KWin responsiveness) requires Phases 0–3. The
+**server-critical path** (multi-socket, NUMA) adds Phase 5. Phase 4 (POSIX completeness)
+benefits all paths but is not desktop-blocking.