diff --git a/.gitignore b/.gitignore
index b999dcbe3b..e2a4c2f4a0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -13,8 +13,12 @@
 # Nested recipe debris from prior build-system layouts (4.2GB+ of duplicates)
 recipes/recipes/
 
-# Fetched source trees in mainline recipes (not our code in local/)
-# Matches recipes/<category>/<name>/source/ but NOT local/recipes/*/source/
+# Fetched source trees in mainline recipes AND in specific local/ build-cache
+# recipes (those whose source/ is a transient working copy re-fetched by the
+# build system from the recipe's `git` URL). The durable code for these is
+# recipe.toml + local/patches/. — DO NOT add a blanket `local/recipes/**/source`
+# rule here: ~150 Red Bear recipes have durable source code under
+# `local/recipes/<name>/source/` (the fork model).
 recipes/**/source
 recipes/**/source.tmp
 recipes/**/source-new
@@ -22,6 +26,10 @@ recipes/**/source-old
 recipes/**/source.tar
 recipes/**/source.tar.tmp
 recipes/**/source.pre-preservation-test/
+local/recipes/archives/uutils-tar/source
+local/recipes/dev/ninja-build/source
+local/recipes/kde/sddm/source
+local/recipes/kde/sddm/source-pristine
 
 # Build artifacts — target/ dirs are everywhere
 target
@@ -31,6 +39,12 @@ wget-log
 # Vendor source trees (fetched, not our code)
 **/amdgpu-source/
 
+# External reference trees (read-only consultation sources). The Linux
+# reference tree (local/reference/linux-7.1) is currently kept locally
+# but is gitignored by size; seL4 reference is an empty placeholder.
+local/reference/linux-*/
+local/reference/seL4/
+
 # Compiled objects
 *.o
 *.so
diff --git a/local/docs/MULTITHREADING-COMPREHENSIVE-ASSESSMENT-AND-PLAN.md b/local/docs/MULTITHREADING-COMPREHENSIVE-ASSESSMENT-AND-PLAN.md
new file mode 100644
index 0000000000..fe2908d4f8
--- /dev/null
+++ b/local/docs/MULTITHREADING-COMPREHENSIVE-ASSESSMENT-AND-PLAN.md
@@ -0,0 +1,720 @@
+# Red Bear OS — Multi-Threading Comprehensive Assessment and Implementation Plan
+
+**Date:** 2026-07-02
+**Scope:** Full-stack multi-threading audit: hardware/SMP, kernel scheduler, kernel futex, kernel syscall ABI, relibc pthreads, userspace threading correctness and performance
+**Status:** Authoritative — supersedes `archived/KERNEL-SCHEDULER-MULTITHREAD-IMPROVEMENT-PLAN.md` and `archived/SCHEDULER-REVIEW-FINAL.md` for all threading matters
+**Validation levels:** `builds` → `enumerates` → `usable` → `validated` → `hardware-validated`
+
+---
+
+## 1. Executive Summary
+
+### The Critical Finding — Lost Threading Work
+
+The P5–P9 scheduler and futex enhancement work (documented as "complete" in the archived
+plans) was **lost during the local fork migration** (2026-06). The local forks at
+`local/sources/kernel/` and `local/sources/relibc/` were created from **upstream Redox
+baselines** that did NOT include the Red Bear enhancement patches. The patches exist in
+`local/patches/kernel/` and `local/patches/relibc/` but are **not wired into the recipes**
+(both `recipe.toml` files use `path = "..."` with no `patches = [...]` list).
+
+**Impact:** The running kernel has:
+- Baseline DWRR scheduler only (no per-CPU queues, no work stealing, no load balancing, no vruntime, no RT scheduling, no cache-affine)
+- Baseline futex only (WAIT/WAIT64/WAKE — no sharding, no PI, no REQUEUE, no robust, no WAKE_OP, no BITSET)
+- relibc `sched_*` are all `todo!()`, `pthread_setschedparam` is a no-op, robust mutexes are `todo_skip!`, PI is absent
+
+**Recovery:** 13 of 18 kernel P5–P9 patches apply cleanly to the current fork. 5 fail due to
+patch-chain dependencies (they expect earlier patches applied first). The bulk of the work is
+recoverable by re-applying patches to the forks and committing them.
+
+### What Actually Works Today
+
+| Layer | Status | Detail |
+|-------|--------|--------|
+| **SMP boot** | ✅ Solid | INIT→SIPI sequence correct, per-CPU PCR via GS_BASE, x2APIC support |
+| **Context switching** | ✅ Solid | FPU/SIMD/AVX state save via XSAVE, FSBASE/GSBASE swap (FSGSBASE or MSR), correct callee-saved register save |
+| **TLB shootdown protocol** | ✅ Correct | AtomicBool flag + IPI + ack counter with `fence(SeqCst)` race prevention |
+| **Basic thread lifecycle** | ✅ Functional | pthread_create/join/detach/exit through proc scheme + redox_rt clone |
+| **Basic synchronization** | ✅ Functional | Futex-backed mutex, condvar, rwlock, barrier, spinlock, once |
+| **TLS** | ✅ Functional | ELF PT_TLS + pthread_key_create/getspecific/setspecific |
+| **Per-CPU data** | ✅ Functional | PercpuBlock via GS_BASE, all per-CPU state accessible |
+| **Signal delivery** | ✅ Functional | Shared-memory Sigcontrol pages, per-thread masks, trampoline |
+| **Scheduler algorithm** | 🚧 Basic DWRR | 40 priority levels, geometric weights, cooperative preemption (3-tick quantum) |
+| **Futex operations** | 🚧 Basic only | WAIT/WAIT64/WAKE with single global mutex |
+| **SMP load balancing** | ❌ Missing | No work stealing, no migration, contexts stuck on birth CPU |
+| **RT scheduling** | ❌ Missing | No SCHED_FIFO/SCHED_RR, no kernel policy dispatch |
+| **Futex REQUEUE** | ❌ Missing | Condvar broadcast causes thundering herd |
+| **Robust mutexes** | ❌ Missing | Thread death while holding mutex → permanent deadlock |
+| **PI futexes** | ❌ Missing | No priority inheritance → priority inversion risk |
+| **CPU affinity API** | ❌ Missing from relibc | Kernel supports sched_affinity field but no userspace API |
+| **Thread naming** | ❌ Missing from relibc | Kernel supports name field but no userspace API |
+| **Per-page TLB flush** | ❌ Missing | `invalidate_all()` = full CR3 reload on every shootdown |
+| **NUMA awareness** | ❌ Missing | No SRAT/SLIT, no proximity domains, flat memory model |
+| **IRQ balancing** | ❌ Missing | All legacy IRQs hardwired to BSP |
+
+---
+
+## 2. Layer-by-Layer Assessment
+
+### 2.1 Hardware / SMP Layer
+
+**Files:** `src/acpi/madt/arch/x86.rs`, `src/arch/x86_shared/start.rs`,
+`src/arch/x86_shared/device/local_apic.rs`, `src/arch/x86_shared/device/ioapic.rs`,
+`src/arch/x86_shared/ipi.rs`, `src/arch/x86_shared/interrupt/ipi.rs`, `src/percpu.rs`,
+`src/arch/x86_shared/gdt.rs`
+
+**Verdict: Functional foundation, performance gaps.**
+
+| Component | Status | Detail |
+|-----------|--------|--------|
+| AP boot (INIT/SIPI) | ✅ validated | Correct trampoline at 0x8000, per-AP PCR/IDT/stack allocation |
+| x2APIC mode | ✅ builds | Detected via CPUID, MSR-based access, APIC ID detection |
+| Per-CPU PCR via GS_BASE | ✅ validated | `PercpuBlock::current()` reads from PCR, SWAPGS protocol correct |
+| IPI send/receive | ✅ functional | 5 IPI kinds (Wakeup/Tlb/Switch/Pit/Profile), broadcast + unicast |
+| TLB shootdown | ✅ correct | AtomicBool + IPI + ack with `fence(SeqCst)` race prevention |
+| TLB granularity | ❌ coarse | Full CR3 reload (`mov cr3, cr3`) on every shootdown — no INVLPG |
+| TLB broadcast | 🚧 sequential | Iterates CPUs individually, doesn't use ICR "all excluding self" shorthand |
+| IRQ routing | ❌ BSP-only | Legacy I/O APIC entries hardcode `dest: bsp_apic_id` |
+| NUMA | ❌ absent | No SRAT/SLIT, no proximity domains |
+| SMT/HT topology | ❌ absent | No cache hierarchy, no hyperthread awareness |
+| Idle loop | ✅ functional | MWAIT with deepest C-state or HLT fallback |
+| W^X for trampoline | 🚧 minor | Trampoline page briefly W+X, unmapped after AP boot |
+
+### 2.2 Kernel Scheduler Layer
+
+**Files:** `src/context/switch.rs`, `src/context/mod.rs`, `src/context/context.rs`,
+`src/context/timeout.rs`
+
+**Verdict: Correct but primitive — DWRR only, no SMP balancing, no RT classes.**
+
+**Algorithm:** Deficit Weighted Round Robin (DWRR)
+- 40 priority levels, each a `VecDeque<WeakContextRef>`
+- Geometric weights: `SCHED_PRIO_TO_WEIGHT[i] ≈ 1.25^i` (88761 → 15)
+- Per-CPU `balance` accumulator drives dequeue decisions
+- Quantum: 3 PIT ticks (~12.2ms) per scheduling round
+- Cooperative preemption: `preempt_locks > 0` disables preemption
+
+**Global locks:**
+- `RUN_CONTEXTS: Mutex<L1, RunContextData>` — all 40 priority queues under one L1 lock
+- `IDLE_CONTEXTS: Mutex<L2, VecDeque<WeakContextRef>>` — sleeping contexts
+- `CONTEXT_SWITCH_LOCK: AtomicBool` — global CAS spinlock serializing all context switches
+
+**What's missing (all was in lost P5–P9 work):**
+
+| Gap | Lost Patch | Recoverable? |
+|-----|-----------|-------------|
+| Per-CPU run queues (eliminate global L1) | P6-percpu-runqueues, P8-percpu-sched, P8-percpu-wiring | ✅ applies cleanly |
+| Work stealing | P8-work-stealing | ❌ needs rebase (depends on per-CPU wiring) |
+| Initial placement (least-loaded CPU) | P8-initial-placement | ✅ applies cleanly |
+| Load balancing | P8-load-balance (absorbed) | needs verification |
+| Vruntime tracking + min-vruntime selection | P6-vruntime-switch | ✅ applies cleanly |
+| SchedPolicy enum (FIFO/RR/Other) | P5-sched-rt-policy | ✅ applies cleanly |
+| RT scheduling dispatch | P5-sched-rt-policy | ✅ applies cleanly |
+| Cache-affine scheduling | P7-cache-affine-switch | ✅ applies cleanly |
+| NUMA topology hints | P9-numa-topology | ✅ applies cleanly |
+
+### 2.3 Kernel Futex Layer
+
+**File:** `src/syscall/futex.rs`
+
+**Verdict: Baseline only — critical operations missing for desktop workloads.**
+
+| Operation | Status | Impact of Absence |
+|-----------|--------|-------------------|
+| `FUTEX_WAIT` (32-bit) | ✅ | — |
+| `FUTEX_WAIT64` (64-bit) | ✅ | — |
+| `FUTEX_WAKE` | ✅ | — |
+| `FUTEX_REQUEUE` | ❌ returns EINVAL | `pthread_cond_broadcast` wakes ALL waiters (thundering herd) |
+| `FUTEX_CMP_REQUEUE` | ❌ not defined | Same + atomicity gap |
+| `FUTEX_WAKE_OP` | ❌ not defined | glibc mutex fast path unavailable |
+| `FUTEX_WAIT_BITSET` | ❌ not defined | `pselect`/`ppoll` optimization unavailable |
+| `FUTEX_WAKE_BITSET` | ❌ not defined | Targeted wake unavailable |
+| `FUTEX_LOCK_PI` / `UNLOCK_PI` | ❌ not defined | Priority inversion unprotected |
+| Robust futex list | ❌ not defined | Thread death → permanent deadlock |
+| Futex sharding (per-futex lock) | ❌ single global L1 mutex | All futex ops on all CPUs contend on one lock |
+| Process-private futexes | ❌ global table | Unnecessary cross-process visibility |
+
+**Architecture:**
+```
+static FUTEXES: Mutex<L1, FutexList>  // single global lock
+type FutexList = HashMap<PhysicalAddress, Vec<FutexEntry>>
+```
+
+Physical address is the key (enables cross-address-space futex via MAP_SHARED).
+Virtual address + Weak<AddrSpaceWrapper> used for CoW disambiguation.
+
+**Recoverable work (lost patches):**
+
+| Feature | Lost Patch | Applies? |
+|---------|-----------|----------|
+| 64-shard hash table | P6-futex-sharding | ✅ cleanly |
+| FUTEX_REQUEUE + CMP_REQUEUE | P8-futex-requeue | ❌ needs rebase |
+| PI futex (LOCK_PI/UNLOCK_PI/TRYLOCK_PI) | P8-futex-pi | ❌ needs rebase |
+| PI CAS fix | P9-futex-pi-cas-fix | ❌ needs rebase |
+| Robust futex list | P8-futex-robust | ❌ needs rebase |
+
+The 4 failing patches likely fail because they depend on sharding (P6-futex-sharding) being
+applied first. Apply in order: P6-sharding → P8-requeue → P8-pi → P8-robust → P9-pi-cas-fix.
+
+### 2.4 Kernel Syscall ABI Layer
+
+**Files:** `src/syscall/mod.rs`, `src/syscall/futex.rs`, `src/syscall/time.rs`,
+`src/syscall/process.rs`, `local/sources/syscall/src/number.rs`, `src/scheme/proc.rs`
+
+**Verdict: Minimal surface — most threading done via proc scheme, not syscalls.**
+
+The kernel defines only ~35 syscall numbers. Threading-relevant ones:
+
+| Syscall | Status | Notes |
+|---------|--------|-------|
+| `SYS_FUTEX` (240) | ✅ partial | WAIT/WAIT64/WAKE only |
+| `SYS_YIELD` (158) | ✅ | `context::switch()` + signal handler |
+| `SYS_FMAP` (900) | ✅ | Anonymous + file-backed mmap |
+| `SYS_FUNMAP` (92) | ✅ | munmap |
+| `SYS_MPROTECT` (125) | ✅ | |
+| `SYS_MREMAP` (155) | ✅ | |
+| `SYS_NANOSLEEP` (162) | ✅ | EINTR-aware |
+| `SYS_CLOCK_GETTIME` (265) | ✅ partial | REALTIME + MONOTONIC only |
+
+**Threading done via proc scheme (not syscalls):**
+
+| Operation | Mechanism |
+|-----------|-----------|
+| Thread/process creation | `proc:` scheme: open "new-context", share addr_space + files via kdup |
+| waitpid | `proc:` scheme: `EVENT_READ` on context fd |
+| getpid/gettid | `proc:` scheme: read "attrs" handle |
+| kill/tkill | `proc:` scheme: `ForceKill` / `Interrupt` ContextVerb |
+| CPU affinity | `proc:` scheme: write "sched-affinity" handle |
+| Priority | `proc:` scheme: write "attrs" prio field |
+| Signal setup | `proc:` scheme: write "sighandler" + shared Sigcontrol pages |
+| TLS base (FSBASE) | `proc:` scheme: write "regs/env" EnvRegisters |
+
+**Completely missing syscalls (no number, no handler):**
+`clone`, `fork`, `vfork`, `waitpid`, `wait4`, `kill`, `tkill`, `tgkill`, `arch_prctl`,
+`set_thread_area`, `set_tid_address`, `set_robust_list`, `get_robust_list`,
+`sched_setaffinity`, `sched_getaffinity`, `sched_setscheduler`, `sched_getparam`,
+`sigaction`, `sigprocmask`, `sigpending`, `sigsuspend`, `sigtimedwait`,
+`timer_create`, `timer_settime`, `timer_delete`, `timerfd_create`,
+`getrusage`, `setrlimit`, `getrlimit`, `times`
+
+### 2.5 relibc Pthread Layer
+
+**Files:** `src/pthread/mod.rs`, `src/sync/*.rs`, `src/header/pthread/*.rs`,
+`src/header/sched/mod.rs`, `src/ld_so/tcb.rs`, `src/platform/redox/mod.rs`
+
+**Verdict: Core pthreads solid, scheduling/robust/PI absent, several POSIX gaps.**
+
+#### Fully Working (futex-backed)
+
+| API Group | Backend | Notes |
+|-----------|---------|-------|
+| `pthread_create/join/detach/exit` | redox_rt clone + Waitval | Stack via mmap, TLS via Tcb::new() |
+| `pthread_cancel/setcancelstate/testcancel` | SIGRT_RLCT_CANCEL (33) | Deferred cancellation only |
+| `pthread_mutex_*` (normal/recursive/errorcheck) | AtomicU32 CAS + futex_wait/wake | 3-state: unlocked/locked/waiters |
+| `pthread_cond_*` | Two-counter futex design | CLOCK_REALTIME only (monotonic = stub) |
+| `pthread_rwlock_*` | AtomicU32 + futex | Reader count + WAITING_WR bit |
+| `pthread_barrier_*` | Mutex + Cond | gen_id wrapping counter |
+| `pthread_spin_*` | AtomicI32 CAS | No futex, pure spinning |
+| `pthread_once` | 3-state futex (UNINIT→INITING→INIT) | |
+| `pthread_key_create/getspecific/setspecific/delete` | BTreeMap global + thread_local values | Destructor iteration per POSIX |
+| `pthread_sigmask` | Delegates to sigprocmask | |
+| `pthread_kill` | redox_rt::rlct_kill | |
+| `pthread_atfork` | Thread-local LinkedList hooks | |
+| ELF TLS (`__thread` / `#[thread_local]`) | PT_TLS + Tcb | Static + dynamic DTV for dlopen |
+| `pthread_attr_*` (getters/setters) | RlctAttr struct | |
+
+#### Stubs / No-ops / Missing
+
+| API | Status | Root Cause |
+|-----|--------|------------|
+| `sched_get_priority_max/min` | `todo!()` | Kernel has no scheduling policy API |
+| `sched_getparam/setparam` | `todo!()` | Same |
+| `sched_setscheduler` | `todo!()` | Same |
+| `sched_rr_get_interval` | `todo!()` | Same |
+| `pthread_setschedparam` | No-op (returns Ok) | Kernel ignores policy |
+| `pthread_setschedprio` | No-op (returns Ok) | Kernel ignores priority change |
+| `pthread_getschedparam` | `todo!()` | |
+| `pthread_getcpuclockid` | ENOENT | No per-thread CPU clock |
+| `pthread_mutex_consistent` | `todo_skip!` | Robust mutex not implemented |
+| `pthread_mutex_getprioceiling` | `todo_skip!` | Priority ceiling not implemented |
+| `pthread_mutex_setprioceiling` | `todo_skip!` | Same |
+| `pthread_mutexattr_setprotocol` (PRIO_INHERIT) | Accepted, no-op | PI futex missing |
+| `pthread_mutexattr_setrobust` (ROBUST) | Accepted, no-op | Robust futex missing |
+| `pthread_cond_init` CLOCK_MONOTONIC | `todo_skip!` | |
+| `pthread_cond_signal` | Calls broadcast (wakes ALL) | Missing FUTEX_REQUEUE optimization |
+| `pthread_setaffinity_np` | Not defined | |
+| `pthread_getaffinity_np` | Not defined | |
+| `pthread_setname_np` | Not defined | |
+| `pthread_getname_np` | Not defined | |
+| `pthread_setcanceltype` | Always returns DEFERRED | ASYNC not tracked |
+| Guard pages | Attribute stored, not mapped | No PROT_NONE page before stack |
+| PTHREAD_KEYS_MAX limit | Not checked | |
+
+---
+
+## 3. Gap Classification
+
+### 3.1 Correctness Gaps (Must Fix — Silent Data Corruption or Deadlock)
+
+| # | Gap | Impact | Fix Location |
+|---|-----|--------|-------------|
+| C1 | **No robust mutexes** | Thread death while holding mutex → permanent deadlock for all waiters | Kernel: robust futex list + relibc: pthread_mutex_consistent |
+| C2 | **No PI futexes** | Priority inversion: low-prio thread blocks high-prio thread indefinitely | Kernel: FUTEX_LOCK_PI/UNLOCK_PI + relibc: mutexattr_setprotocol |
+| C3 | **`pthread_cond_signal` wakes ALL** | Correctness: wastes CPU. Performance: thundering herd on every signal | relibc: use true wake(1) — may need FUTEX_REQUEUE |
+| C4 | **`fork()` not thread-safe** | `pthread_atfork` hooks exist but child inherits locked mutexes | relibc: implement atfork child handlers properly |
+
+### 3.2 Performance Gaps (Must Fix for Desktop Responsiveness)
+
+| # | Gap | Impact | Fix Location |
+|---|-----|--------|-------------|
+| P1 | **No SMP load balancing** | Cores sit idle while others are overloaded | Kernel: work stealing + initial placement |
+| P2 | **No futex sharding** | Single global L1 mutex for ALL futex ops on ALL CPUs | Kernel: 64-shard hash table |
+| P3 | **No FUTEX_REQUEUE** | `pthread_cond_broadcast` wakes all → thundering herd | Kernel: REQUEUE + CMP_REQUEUE |
+| P4 | **Full TLB flush on every shootdown** | Per-page mprotect/munmap flushes entire TLB on all cores | Kernel: INVLPG-based selective flush |
+| P5 | **Global context switch lock** | Serialization bottleneck beyond ~8 cores | Kernel: per-CPU context switch (needs per-CPU run queues) |
+| P6 | **All IRQs to BSP** | CPU 0 handles all interrupts, cache thrash, latency | Kernel: IRQ steering in I/O APIC + MSI/MSI-X dest field |
+| P7 | **No RT scheduling** | Audio/compositor threads can't get priority | Kernel: SchedPolicy + RT dispatch + relibc: sched_setscheduler |
+
+### 3.3 POSIX Completeness Gaps (Must Fix for Application Compatibility)
+
+| # | Gap | Impact | Fix Location |
+|---|-----|--------|-------------|
+| X1 | `sched_*` all `todo!()` | Applications calling sched_setscheduler panic | relibc: implement via proc scheme |
+| X2 | `pthread_setschedparam` no-op | Apps can't change thread priority | relibc: wire to proc scheme prio write |
+| X3 | `pthread_setaffinity_np` missing | Apps can't pin threads to CPUs | relibc: implement via proc scheme affinity write |
+| X4 | `pthread_setname_np` missing | Debugging harder (no thread names in /proc) | relibc: implement via proc scheme name write |
+| X5 | `pthread_getcpuclockid` ENOENT | Per-thread profiling impossible | relibc + kernel: expose cpu_time via clock |
+| X6 | Guard pages not mapped | Stack overflow → silent corruption, no SIGSEGV | relibc: mmap PROT_NONE guard page in pthread_create |
+| X7 | `pthread_cond_init` monotonic stub | CLOCK_MONOTONIC condvars use REALTIME (affected by wall clock jumps) | relibc: implement monotonic condvar |
+
+---
+
+## 4. Implementation Plan
+
+### Phase 0: Patch Recovery — Re-Apply Lost Threading Work (Week 1–2)
+
+**Goal:** Recover the P5–P9 work that was lost during the local fork migration.
+
+**This is the highest-priority phase — it restores ~6 months of work with minimal new code.**
+
+#### 0.1 — Re-apply kernel scheduler patches to local fork
+
+Apply in dependency order to `local/sources/kernel/`:
+
+| Order | Patch | Status | Action |
+|-------|-------|--------|--------|
+| 1 | P6-futex-sharding | ✅ applies | Commit directly |
+| 2 | P6-percpu-runqueues | ✅ applies | Commit directly |
+| 3 | P8-percpu-sched | ✅ applies | Commit directly |
+| 4 | P8-percpu-wiring | ✅ applies | Commit directly |
+| 5 | P8-initial-placement | ✅ applies | Commit directly |
+| 6 | P5-sched-rt-policy | ✅ applies | Commit directly |
+| 7 | P5-context-mod-sched | ✅ applies | Commit directly |
+| 8 | P6-vruntime-switch | ✅ applies | Commit directly |
+| 9 | P7-cache-affine-switch | ✅ applies | Commit directly |
+| 10 | P9-numa-topology | ✅ applies | Commit directly |
+| 11 | P9-proc-lock-ordering | ✅ applies | Commit directly |
+| 12 | P8-work-stealing | ❌ needs rebase | Rebase against 1–11, then apply |
+| 13 | P8-futex-requeue | ❌ needs rebase | Rebase against P6-sharding (#1), then apply |
+| 14 | P8-futex-pi | ❌ needs rebase | Rebase against #13, then apply |
+| 15 | P8-futex-robust | ❌ needs rebase | Rebase against #14, then apply |
+| 16 | P9-futex-pi-cas-fix | ❌ needs rebase | Rebase against #14, then apply |
+| 17 | P7-scheduler-improvements | ❌ needs rebase | Rebase against 1–11, then apply |
+
+**Verification after each patch:**
+```bash
+cd local/sources/kernel
+cargo check  # must pass
+```
+
+#### 0.2 — Re-apply relibc threading patches to local fork
+
+Apply to `local/sources/relibc/`:
+
+| Patch | Action |
+|-------|--------|
+| P3-threads.patch | ✅ applies — commit |
+| P3-barrier-smp-futex (from absorbed/) | Verify already in fork; if not, apply |
+| P3-pthread-signal-races (from absorbed/) | Verify already in fork |
+| P3-pthread-yield (from absorbed/) | Verify already in fork |
+| P5-robust-mutexes (from absorbed/) | Verify; re-apply if missing |
+| P5-robust-mutex-enotrec-fix (from absorbed/) | Same |
+| P5-sched-api (from absorbed/) | Same |
+| P7-pthread-affinity (from absorbed/) | Same |
+| P7-pthread-setname (from absorbed/) | Same |
+| P7-setpriority (from absorbed/) | Same |
+| P9-spin-and-barrier (from absorbed/) | Same |
+| P9-spin-fix (from absorbed/) | Same |
+| P3-semaphore-comprehensive | ✅ applies |
+
+**Verification:**
+```bash
+cd local/sources/relibc
+make all  # must pass
+touch relibc && make prefix  # rebuild prefix with new libc
+```
+
+#### 0.3 — Build and smoke test
+
+```bash
+export REDBEAR_ALLOW_PROTECTED_FETCH=1
+./local/scripts/build-redbear.sh --upstream redbear-mini
+make qemu  # verify boot + basic operation
+```
+
+**Success criteria:** redbear-mini boots, multi-threaded daemons (pcid, xhcid) start, no kernel panic.
+
+---
+
+### Phase 1: Futex Completeness (Week 2–4)
+
+**Goal:** Close the futex operation gaps that affect correctness and performance.
+
+**Depends on:** Phase 0 complete (sharding applied first).
+
+#### 1.1 — FUTEX_REQUEUE + FUTEX_CMP_REQUEUE
+
+**Kernel:** `src/syscall/futex.rs`
+- Add `FUTEX_REQUEUE` and `FUTEX_CMP_REQUEUE` to the futex dispatcher
+- Implement: move up to `val` waiters from addr1 → addr2, optionally compare `*addr1 == val2`
+- Requires locking TWO shards (acquire both in deterministic order to avoid deadlock)
+
+**relibc:** `src/sync/cond.rs`
+- Change `pthread_cond_broadcast` to use `FUTEX_REQUEUE` (move waiters from condvar futex to mutex futex)
+- Change `pthread_cond_signal` to wake exactly 1 (not all)
+
+**Impact:** Eliminates thundering herd on every `pthread_cond_broadcast`. Major win for Qt event loop, KWin compositor, Mesa worker threads.
+
+#### 1.2 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI)
+
+**Kernel:** `src/syscall/futex.rs`
+- Add `PiState` tracking per futex: owner context + waiter list with priorities
+- On `LOCK_PI` block: boost owner's priority to waiter's priority
+- On `UNLOCK_PI`: restore original priority, wake highest-priority waiter
+- Requires kernel RT scheduling (Phase 0.1 #6–7: P5-sched-rt-policy)
+
+**relibc:** `src/sync/pthread_mutex.rs`
+- Implement `PTHREAD_PRIO_INHERIT` protocol path using PI futex
+- Replace `todo_skip!` in `pthread_mutex_consistent` with real implementation
+
+#### 1.3 — Robust Futex List
+
+**Kernel:** `src/syscall/futex.rs` + `src/context/context.rs`
+- Add `robust_list_head: Option<usize>` to `Context` struct
+- Implement `set_robust_list` / `get_robust_list` via proc scheme or syscall
+- On thread exit (`exit_this_context`): walk robust list, set `FUTEX_OWNER_DIED` bit, wake one waiter with `EOWNERDEAD`
+
+**relibc:** `src/sync/pthread_mutex.rs`
+- Implement robust list registration in `pthread_mutex_lock`
+- Implement `pthread_mutex_consistent`: clear `EOWNERDEAD` state
+- Replace `todo_skip!` with real implementation
+
+#### 1.4 — FUTEX_WAKE_OP
+
+**Kernel:** `src/syscall/futex.rs`
+- Implement atomic op + wake: perform op on addr2, then wake up to `val` waiters on addr1
+- Operations: set, add, or, andn, xor, with comparison condition
+
+**Impact:** glibc mutex fast path optimization. Not critical for relibc but helps ported glibc-linked binaries.
+
+---
+
+### Phase 2: SMP Scheduling Quality (Week 3–6)
+
+**Goal:** Make multi-core actually distribute work.
+
+**Depends on:** Phase 0 complete (per-CPU queues applied).
+
+#### 2.1 — Work stealing (recover + fix)
+
+**Kernel:** `src/context/switch.rs`
+- On `select_next_context()` empty local queue: steal from victim CPU
+- Pick victim by round-robin, steal highest-priority runnable context
+- Limit steal batch size (1–2 contexts per steal attempt)
+- Send `IpiKind::Wakeup` to target CPU if stealing woke it from idle
+
+**Recovery:** P8-work-stealing needs rebase against per-CPU wiring.
+
+#### 2.2 — Load balancing (recover + verify)
+
+**Kernel:** `src/context/switch.rs`
+- Periodic balance trigger (every N ticks or when queue depth difference > threshold)
+- Migrate contexts from overloaded CPU to most-idle CPU
+- Respect `sched_affinity` mask during migration
+
+**Recovery:** P8-load-balance is in absorbed/ — verify it's in the fork after Phase 0.
+
+#### 2.3 — Reschedule IPI
+
+**Kernel:** `src/arch/x86_shared/ipi.rs` + `src/context/switch.rs`
+- When waking a context on a different CPU, send `IpiKind::Switch` to that CPU
+- Currently the Switch IPI exists but is not used by the scheduler
+
+#### 2.4 — Per-page TLB flush (INVLPG)
+
+**Kernel:** `rmm/src/arch/x86_64.rs` + `src/context/memory.rs`
+- Add `invalidate_page(addr)` using `invlpg` instruction
+- Modify `Flusher` to track individual pages and use INVLPG when ≤ N pages affected
+- Fall back to CR3 reload only for large-scale invalidations
+
+**Impact:** Every `mprotect`/`mmap`/`munmap` on a multi-threaded process currently flushes the ENTIRE TLB on every core. This is one of the most impactful single fixes.
+
+#### 2.5 — TLB broadcast optimization
+
+**Kernel:** `src/percpu.rs`
+- Replace per-CPU sequential `shootdown_tlb_ipi(Some(id))` loop with ICR "all excluding self" (destination shorthand 0b11)
+- Single IPI + global ack counter instead of N individual IPIs + N ack counters
+
+---
+
+### Phase 3: RT Scheduling (Week 4–6)
+
+**Goal:** Allow applications to request real-time scheduling for latency-sensitive threads.
+
+**Depends on:** Phase 0 (SchedPolicy applied) + Phase 2 (per-CPU queues).
+
+#### 3.1 — Kernel RT scheduling dispatch
+
+**Kernel:** `src/context/switch.rs` (from P5-sched-rt-policy — recovered in Phase 0)
+- `select_next_context()` passes:
+  1. SCHED_FIFO contexts (highest RT priority first, no preemption within same prio)
+  2. SCHED_RR contexts (highest RT priority first, round-robin within same prio)
+  3. SCHED_OTHER contexts (existing DWRR/vruntime)
+- SCHED_RR quantum: configurable per-context (default 100ms)
+
+#### 3.2 — relibc sched_* API completion
+
+**relibc:** `src/header/sched/mod.rs`
+
+Replace ALL `todo!()` stubs:
+
+| Function | Implementation |
+|----------|---------------|
+| `sched_getscheduler(pid)` | Read policy from proc scheme attrs |
+| `sched_setscheduler(pid, policy, param)` | Write policy + RT priority via proc scheme |
+| `sched_getparam(pid, param)` | Read RT priority from proc scheme |
+| `sched_setparam(pid, param)` | Write RT priority via proc scheme |
+| `sched_get_priority_max(policy)` | Return 99 for FIFO/RR, 0 for OTHER |
+| `sched_get_priority_min(policy)` | Return 1 for FIFO/RR, 0 for OTHER |
+| `sched_rr_get_interval(pid, tp)` | Return SCHED_RR quantum (100ms default) |
+
+#### 3.3 — pthread_setschedparam wiring
+
+**relibc:** `src/pthread/mod.rs`
+- Replace `set_sched_param` no-op with real proc scheme call
+- Replace `set_sched_priority` no-op with real proc scheme call
+
+---
+
+### Phase 4: POSIX Pthread Completeness (Week 5–8)
+
+**Goal:** Close remaining POSIX gaps that block application compatibility.
+
+**Depends on:** Phase 0 + Phase 3 (for sched API).
+
+#### 4.1 — pthread_setaffinity_np / pthread_getaffinity_np
+
+**relibc:** `src/header/pthread/mod.rs` + `src/header/sched/mod.rs`
+- Implement using proc scheme "sched-affinity" write/read
+- Define `cpu_set_t` type and `CPU_SET/CPU_CLR/CPU_ZERO/CPU_ISSET` macros
+
+#### 4.2 — pthread_setname_np / pthread_getname_np
+
+**relibc:** `src/header/pthread/mod.rs`
+- Implement using proc scheme name write/read (kernel already supports 32-char name field)
+
+#### 4.3 — pthread_cond_init CLOCK_MONOTONIC
+
+**relibc:** `src/sync/cond.rs`
+- Replace `todo_skip!` with real monotonic clock support
+- Store clock choice in cond struct, use `CLOCK_MONOTONIC` for deadline calculations
+
+#### 4.4 — Guard pages
+
+**relibc:** `src/pthread/mod.rs`
+- In `pthread_create`, when allocating stack via mmap:
+  - Map `[stack_base, stack_base + guard_size)` with `PROT_NONE`
+  - Map `[stack_base + guard_size, stack_base + guard_size + stack_size)` with `PROT_READ | PROT_WRITE`
+- On thread exit, munmap both regions
+
+#### 4.5 — pthread_getcpuclockid
+
+**relibc:** `src/header/pthread/mod.rs`
+- Return `CLOCK_THREAD_CPUTIME_ID` (requires kernel support — add clock to `clock_gettime`)
+
+**Kernel:** `src/syscall/time.rs`
+- Add `CLOCK_THREAD_CPUTIME_ID` → read `context.cpu_time`
+
+#### 4.6 — PTHREAD_KEYS_MAX enforcement
+
+**relibc:** `src/header/pthread/tls.rs`
+- Check `NEXTKEY` against `PTHREAD_KEYS_MAX` (1024) before allocating
+
+---
+
+### Phase 5: IRQ Steering and NUMA (Week 8–12)
+
+**Goal:** Distribute interrupt load and respect memory locality.
+
+**Depends on:** Phase 2 (per-CPU infrastructure).
+
+#### 5.1 — IRQ steering
+
+**Kernel:** `src/arch/x86_shared/device/ioapic.rs` + `src/arch/x86_shared/idt.rs`
+- Change I/O APIC redirection `dest` from `bsp_apic_id` to round-robin or RSS hash
+- Add per-CPU legacy IRQ handlers in IDT (not just BSP)
+- For MSI/MSI-X: set destination CPU in Message Address register
+
+#### 5.2 — NUMA topology discovery
+
+**Kernel:** `src/acpi/` (from P9-numa-topology — recovered in Phase 0)
+- Parse SRAT (Static Resource Affinity Table) for proximity domains
+- Parse SLIT (System Locality Distance Information Table) for inter-node distances
+- Store `NumaTopology` in kernel for O(1) scheduling lookups
+
+#### 5.3 — NUMA-aware memory allocation
+
+**Kernel:** `src/memory/` + frame allocator
+- Track frame NUMA node in `Frame` or `PageInfo`
+- On allocation, prefer frames from requesting CPU's NUMA node
+- Fallback to remote node when local node is exhausted
+
+---
+
+## 5. Dependency Chain
+
+```
+Phase 0 (Patch Recovery) ← BLOCKING FOR ALL OTHERS
+    │
+    ├──► Phase 1 (Futex Completeness)
+    │       │
+    │       ├──► 1.1 REQUEUE ──► condvar performance
+    │       ├──► 1.2 PI ──► priority inversion fix (needs Phase 3.1)
+    │       ├──► 1.3 Robust ──► deadlock prevention
+    │       └──► 1.4 WAKE_OP ──► glibc compat
+    │
+    ├──► Phase 2 (SMP Scheduling)
+    │       │
+    │       ├──► 2.1 Work stealing ──► core utilization
+    │       ├──► 2.2 Load balancing ──► fair distribution
+    │       ├──► 2.3 Reschedule IPI ──→ cross-CPU wakeup
+    │       ├──► 2.4 Per-page TLB ──► mmap/mprotect performance
+    │       └──► 2.5 TLB broadcast ──► IPI efficiency
+    │
+    ├──► Phase 3 (RT Scheduling)
+    │       │
+    │       ├──► 3.1 Kernel RT dispatch (from Phase 0)
+    │       ├──► 3.2 relibc sched_* API ──► POSIX compat
+    │       └──► 3.3 pthread_setschedparam ──► app priority control
+    │
+    ├──► Phase 4 (POSIX Pthread Completeness)
+    │       │
+    │       ├──► 4.1 Affinity API ──► CPU pinning
+    │       ├──► 4.2 Thread naming ──► debuggability
+    │       ├──► 4.3 Monotonic condvar ──► clock correctness
+    │       ├──► 4.4 Guard pages ──► stack overflow detection
+    │       ├──► 4.5 CPU clock ──► per-thread profiling
+    │       └──► 4.6 Keys max ──► resource limit
+    │
+    └──► Phase 5 (IRQ + NUMA)
+            │
+            ├──► 5.1 IRQ steering ──► interrupt distribution
+            ├──► 5.2 NUMA topology ──► (from Phase 0)
+            └──► 5.3 NUMA allocator ──► memory locality
+```
+
+**Parallel work possible:**
+- Phase 1 + Phase 2 + Phase 3 can run in parallel after Phase 0
+- Phase 4 items are independent of each other
+- Phase 5 depends on Phase 2 but not on Phase 1/3/4
+
+---
+
+## 6. Validation Plan
+
+### 6.1 Build Evidence
+
+| Check | Command |
+|-------|---------|
+| Kernel compiles | `make r.kernel` |
+| relibc compiles | `make r.relibc` |
+| Prefix rebuilt | `touch relibc kernel && make prefix` |
+| Full OS builds | `make all CONFIG_NAME=redbear-mini` |
+
+### 6.2 Runtime Evidence (QEMU)
+
+| Test | Verification |
+|------|-------------|
+| Multi-threaded boot | `make qemu QEMUFLAGS="-smp 4"` — all 4 CPUs active |
+| pthread smoke test | Guest: compile + run simple pthread_create/join/mutex test |
+| Work stealing | Guest: spawn 8 threads on 4-CPU QEMU, verify all CPUs utilized |
+| Futex REQUEUE | Guest: condvar broadcast benchmark — waiters wake in ≤2 batches, not N |
+| PI futex | Guest: priority inversion test — high-prio thread unblocked within 1 tick |
+| Robust mutex | Guest: kill thread holding mutex, verify EOWNERDEAD recovery |
+| RT scheduling | Guest: SCHED_FIFO thread preempts SCHED_OTHER within 100μs |
+| CPU affinity | Guest: pin thread to CPU 1, verify it never runs on CPU 0 |
+| Thread naming | Guest: `cat /scheme/proc/*/name` shows set names |
+| Guard pages | Guest: overflow stack, verify SIGSEGV (not silent corruption) |
+| TLB efficiency | Guest: mprotect benchmark — compare TLB miss rate before/after |
+
+### 6.3 Validation Scripts (to create)
+
+```bash
+local/scripts/test-threading-qemu.sh          # Comprehensive threading smoke test
+local/scripts/test-futex-requeue-qemu.sh      # REQUEUE-specific test
+local/scripts/test-futex-pi-qemu.sh           # PI futex test
+local/scripts/test-futex-robust-qemu.sh       # Robust mutex test
+local/scripts/test-sched-rt-qemu.sh           # RT scheduling latency test
+local/scripts/test-sched-balance-qemu.sh      # Load balancing on multi-vCPU
+local/scripts/test-threading-baremetal.sh     # Bare metal multi-threaded stress
+```
+
+---
+
+## 7. Estimated Effort
+
+| Phase | Duration | New Code | Recovery | Dependencies |
+|-------|----------|----------|----------|-------------|
+| Phase 0: Patch Recovery | 1–2 weeks | Minimal (rebase 5 patches) | 13 patches apply directly | None |
+| Phase 1: Futex Completeness | 2–3 weeks | REQUEUE impl + WAKE_OP | PI/robust from P8 patches | Phase 0 |
+| Phase 2: SMP Scheduling | 3–4 weeks | TLB INVLPG + broadcast opt | Work stealing from P8 | Phase 0 |
+| Phase 3: RT Scheduling | 1–2 weeks | relibc sched_* API | RT dispatch from P5 | Phase 0 |
+| Phase 4: POSIX Pthread | 2–3 weeks | Affinity/naming/guard/clock | Partial from P7 patches | Phase 0, 3 |
+| Phase 5: IRQ + NUMA | 3–4 weeks | IRQ steering + NUMA allocator | NUMA topology from P9 | Phase 0, 2 |
+
+**Total:** 12–18 weeks with 1–2 developers. Phase 0 alone recovers the majority of the value in 1–2 weeks.
+
+---
+
+## 8. Integration with Existing Plans
+
+| Plan | Relationship |
+|------|-------------|
+| `CONSOLE-TO-KDE-DESKTOP-PLAN.md` | **Consumer** — Phase 3 (KWin) needs PI futex + RT scheduling; Phase 2 (compositor) needs work stealing |
+| `IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md` | **Sibling** — IRQ steering (Phase 5.1) belongs to both plans |
+| `DRM-MODERNIZATION-EXECUTION-PLAN.md` | **Consumer** — GPU worker threads benefit from load balancing + affinity |
+| `IMPLEMENTATION-MASTER-PLAN.md` | **Parent** — this plan covers the kernel threading substrate |
+| `CPU-DMA-IRQ-MSI-SCHEDULER-FIX-PLAN.md` | **Sibling** — overlaps on scheduler/IRQ delivery |
+
+---
+
+## 9. Bottom Line
+
+The Red Bear OS threading stack is **functional for basic single-threaded and lightly-threaded
+workloads**. The SMP boot, context switching, TLB shootdown, and basic futex operations are
+correct.
+
+The **critical problem** is that 6 months of threading enhancement work (P5–P9 patches) was
+lost during the local fork migration. This work exists as patch files that apply cleanly to
+the current fork — **Phase 0 (Patch Recovery) is the single highest-ROI action**.
+
+After Phase 0, the remaining gaps are:
+1. **Futex REQUEUE/PI/robust** — for condvar performance and deadlock prevention
+2. **SMP work stealing + load balancing** — for multi-core utilization
+3. **RT scheduling** — for audio/compositor thread priority
+4. **POSIX pthread completeness** — for application compatibility
+5. **IRQ steering + NUMA** — for multi-socket performance
+
+The **desktop-critical path** (KWin responsiveness) requires Phases 0–3. The
+**server-critical path** (multi-socket, NUMA) adds Phase 5. Phase 4 (POSIX completeness)
+benefits all paths but is not desktop-blocking.
diff --git a/local/recipes/archives/uutils-tar/source b/local/recipes/archives/uutils-tar/source
deleted file mode 160000
index e4c2affa98..0000000000
--- a/local/recipes/archives/uutils-tar/source
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit e4c2affa98175249af3789f13737a3f1e58c1917
diff --git a/local/recipes/dev/ninja-build/source b/local/recipes/dev/ninja-build/source
deleted file mode 160000
index d829f42b8d..0000000000
--- a/local/recipes/dev/ninja-build/source
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit d829f42b8dcf6d2114b23e0c195eb395254a21ca
diff --git a/local/recipes/kde/sddm/source b/local/recipes/kde/sddm/source
deleted file mode 160000
index 63780fcd79..0000000000
--- a/local/recipes/kde/sddm/source
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit 63780fcd79f1dbf81a30eef48c28c699ab15aded
diff --git a/local/recipes/kde/sddm/source-pristine b/local/recipes/kde/sddm/source-pristine
deleted file mode 160000
index 63780fcd79..0000000000
--- a/local/recipes/kde/sddm/source-pristine
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit 63780fcd79f1dbf81a30eef48c28c699ab15aded
diff --git a/local/reference/linux-7.1 b/local/reference/linux-7.1
deleted file mode 160000
index ab9de95c9c..0000000000
--- a/local/reference/linux-7.1
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit ab9de95c9cf952332ab79453b4b5d1bfca8e514f
diff --git a/local/reference/seL4 b/local/reference/seL4
deleted file mode 160000
index a0b4f2d25d..0000000000
--- a/local/reference/seL4
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit a0b4f2d25dc975f6a9198c081359c0e38e5614fb
diff --git a/local/sources/libredox b/local/sources/libredox
new file mode 160000
index 0000000000..d01da350c1
--- /dev/null
+++ b/local/sources/libredox
@@ -0,0 +1 @@
+Subproject commit d01da350c18c2ab0709923dac602b2264a6b4530