diff --git a/.gitignore b/.gitignore index b999dcbe3b..e2a4c2f4a0 100644 --- a/.gitignore +++ b/.gitignore @@ -13,8 +13,12 @@ # Nested recipe debris from prior build-system layouts (4.2GB+ of duplicates) recipes/recipes/ -# Fetched source trees in mainline recipes (not our code in local/) -# Matches recipes///source/ but NOT local/recipes/*/source/ +# Fetched source trees in mainline recipes AND in specific local/ build-cache +# recipes (those whose source/ is a transient working copy re-fetched by the +# build system from the recipe's `git` URL). The durable code for these is +# recipe.toml + local/patches/. — DO NOT add a blanket `local/recipes/**/source` +# rule here: ~150 Red Bear recipes have durable source code under +# `local/recipes//source/` (the fork model). recipes/**/source recipes/**/source.tmp recipes/**/source-new @@ -22,6 +26,10 @@ recipes/**/source-old recipes/**/source.tar recipes/**/source.tar.tmp recipes/**/source.pre-preservation-test/ +local/recipes/archives/uutils-tar/source +local/recipes/dev/ninja-build/source +local/recipes/kde/sddm/source +local/recipes/kde/sddm/source-pristine # Build artifacts — target/ dirs are everywhere target @@ -31,6 +39,12 @@ wget-log # Vendor source trees (fetched, not our code) **/amdgpu-source/ +# External reference trees (read-only consultation sources). The Linux +# reference tree (local/reference/linux-7.1) is currently kept locally +# but is gitignored by size; seL4 reference is an empty placeholder. +local/reference/linux-*/ +local/reference/seL4/ + # Compiled objects *.o *.so diff --git a/local/docs/MULTITHREADING-COMPREHENSIVE-ASSESSMENT-AND-PLAN.md b/local/docs/MULTITHREADING-COMPREHENSIVE-ASSESSMENT-AND-PLAN.md new file mode 100644 index 0000000000..fe2908d4f8 --- /dev/null +++ b/local/docs/MULTITHREADING-COMPREHENSIVE-ASSESSMENT-AND-PLAN.md @@ -0,0 +1,720 @@ +# Red Bear OS — Multi-Threading Comprehensive Assessment and Implementation Plan + +**Date:** 2026-07-02 +**Scope:** Full-stack multi-threading audit: hardware/SMP, kernel scheduler, kernel futex, kernel syscall ABI, relibc pthreads, userspace threading correctness and performance +**Status:** Authoritative — supersedes `archived/KERNEL-SCHEDULER-MULTITHREAD-IMPROVEMENT-PLAN.md` and `archived/SCHEDULER-REVIEW-FINAL.md` for all threading matters +**Validation levels:** `builds` → `enumerates` → `usable` → `validated` → `hardware-validated` + +--- + +## 1. Executive Summary + +### The Critical Finding — Lost Threading Work + +The P5–P9 scheduler and futex enhancement work (documented as "complete" in the archived +plans) was **lost during the local fork migration** (2026-06). The local forks at +`local/sources/kernel/` and `local/sources/relibc/` were created from **upstream Redox +baselines** that did NOT include the Red Bear enhancement patches. The patches exist in +`local/patches/kernel/` and `local/patches/relibc/` but are **not wired into the recipes** +(both `recipe.toml` files use `path = "..."` with no `patches = [...]` list). + +**Impact:** The running kernel has: +- Baseline DWRR scheduler only (no per-CPU queues, no work stealing, no load balancing, no vruntime, no RT scheduling, no cache-affine) +- Baseline futex only (WAIT/WAIT64/WAKE — no sharding, no PI, no REQUEUE, no robust, no WAKE_OP, no BITSET) +- relibc `sched_*` are all `todo!()`, `pthread_setschedparam` is a no-op, robust mutexes are `todo_skip!`, PI is absent + +**Recovery:** 13 of 18 kernel P5–P9 patches apply cleanly to the current fork. 5 fail due to +patch-chain dependencies (they expect earlier patches applied first). The bulk of the work is +recoverable by re-applying patches to the forks and committing them. + +### What Actually Works Today + +| Layer | Status | Detail | +|-------|--------|--------| +| **SMP boot** | ✅ Solid | INIT→SIPI sequence correct, per-CPU PCR via GS_BASE, x2APIC support | +| **Context switching** | ✅ Solid | FPU/SIMD/AVX state save via XSAVE, FSBASE/GSBASE swap (FSGSBASE or MSR), correct callee-saved register save | +| **TLB shootdown protocol** | ✅ Correct | AtomicBool flag + IPI + ack counter with `fence(SeqCst)` race prevention | +| **Basic thread lifecycle** | ✅ Functional | pthread_create/join/detach/exit through proc scheme + redox_rt clone | +| **Basic synchronization** | ✅ Functional | Futex-backed mutex, condvar, rwlock, barrier, spinlock, once | +| **TLS** | ✅ Functional | ELF PT_TLS + pthread_key_create/getspecific/setspecific | +| **Per-CPU data** | ✅ Functional | PercpuBlock via GS_BASE, all per-CPU state accessible | +| **Signal delivery** | ✅ Functional | Shared-memory Sigcontrol pages, per-thread masks, trampoline | +| **Scheduler algorithm** | 🚧 Basic DWRR | 40 priority levels, geometric weights, cooperative preemption (3-tick quantum) | +| **Futex operations** | 🚧 Basic only | WAIT/WAIT64/WAKE with single global mutex | +| **SMP load balancing** | ❌ Missing | No work stealing, no migration, contexts stuck on birth CPU | +| **RT scheduling** | ❌ Missing | No SCHED_FIFO/SCHED_RR, no kernel policy dispatch | +| **Futex REQUEUE** | ❌ Missing | Condvar broadcast causes thundering herd | +| **Robust mutexes** | ❌ Missing | Thread death while holding mutex → permanent deadlock | +| **PI futexes** | ❌ Missing | No priority inheritance → priority inversion risk | +| **CPU affinity API** | ❌ Missing from relibc | Kernel supports sched_affinity field but no userspace API | +| **Thread naming** | ❌ Missing from relibc | Kernel supports name field but no userspace API | +| **Per-page TLB flush** | ❌ Missing | `invalidate_all()` = full CR3 reload on every shootdown | +| **NUMA awareness** | ❌ Missing | No SRAT/SLIT, no proximity domains, flat memory model | +| **IRQ balancing** | ❌ Missing | All legacy IRQs hardwired to BSP | + +--- + +## 2. Layer-by-Layer Assessment + +### 2.1 Hardware / SMP Layer + +**Files:** `src/acpi/madt/arch/x86.rs`, `src/arch/x86_shared/start.rs`, +`src/arch/x86_shared/device/local_apic.rs`, `src/arch/x86_shared/device/ioapic.rs`, +`src/arch/x86_shared/ipi.rs`, `src/arch/x86_shared/interrupt/ipi.rs`, `src/percpu.rs`, +`src/arch/x86_shared/gdt.rs` + +**Verdict: Functional foundation, performance gaps.** + +| Component | Status | Detail | +|-----------|--------|--------| +| AP boot (INIT/SIPI) | ✅ validated | Correct trampoline at 0x8000, per-AP PCR/IDT/stack allocation | +| x2APIC mode | ✅ builds | Detected via CPUID, MSR-based access, APIC ID detection | +| Per-CPU PCR via GS_BASE | ✅ validated | `PercpuBlock::current()` reads from PCR, SWAPGS protocol correct | +| IPI send/receive | ✅ functional | 5 IPI kinds (Wakeup/Tlb/Switch/Pit/Profile), broadcast + unicast | +| TLB shootdown | ✅ correct | AtomicBool + IPI + ack with `fence(SeqCst)` race prevention | +| TLB granularity | ❌ coarse | Full CR3 reload (`mov cr3, cr3`) on every shootdown — no INVLPG | +| TLB broadcast | 🚧 sequential | Iterates CPUs individually, doesn't use ICR "all excluding self" shorthand | +| IRQ routing | ❌ BSP-only | Legacy I/O APIC entries hardcode `dest: bsp_apic_id` | +| NUMA | ❌ absent | No SRAT/SLIT, no proximity domains | +| SMT/HT topology | ❌ absent | No cache hierarchy, no hyperthread awareness | +| Idle loop | ✅ functional | MWAIT with deepest C-state or HLT fallback | +| W^X for trampoline | 🚧 minor | Trampoline page briefly W+X, unmapped after AP boot | + +### 2.2 Kernel Scheduler Layer + +**Files:** `src/context/switch.rs`, `src/context/mod.rs`, `src/context/context.rs`, +`src/context/timeout.rs` + +**Verdict: Correct but primitive — DWRR only, no SMP balancing, no RT classes.** + +**Algorithm:** Deficit Weighted Round Robin (DWRR) +- 40 priority levels, each a `VecDeque` +- Geometric weights: `SCHED_PRIO_TO_WEIGHT[i] ≈ 1.25^i` (88761 → 15) +- Per-CPU `balance` accumulator drives dequeue decisions +- Quantum: 3 PIT ticks (~12.2ms) per scheduling round +- Cooperative preemption: `preempt_locks > 0` disables preemption + +**Global locks:** +- `RUN_CONTEXTS: Mutex` — all 40 priority queues under one L1 lock +- `IDLE_CONTEXTS: Mutex>` — sleeping contexts +- `CONTEXT_SWITCH_LOCK: AtomicBool` — global CAS spinlock serializing all context switches + +**What's missing (all was in lost P5–P9 work):** + +| Gap | Lost Patch | Recoverable? | +|-----|-----------|-------------| +| Per-CPU run queues (eliminate global L1) | P6-percpu-runqueues, P8-percpu-sched, P8-percpu-wiring | ✅ applies cleanly | +| Work stealing | P8-work-stealing | ❌ needs rebase (depends on per-CPU wiring) | +| Initial placement (least-loaded CPU) | P8-initial-placement | ✅ applies cleanly | +| Load balancing | P8-load-balance (absorbed) | needs verification | +| Vruntime tracking + min-vruntime selection | P6-vruntime-switch | ✅ applies cleanly | +| SchedPolicy enum (FIFO/RR/Other) | P5-sched-rt-policy | ✅ applies cleanly | +| RT scheduling dispatch | P5-sched-rt-policy | ✅ applies cleanly | +| Cache-affine scheduling | P7-cache-affine-switch | ✅ applies cleanly | +| NUMA topology hints | P9-numa-topology | ✅ applies cleanly | + +### 2.3 Kernel Futex Layer + +**File:** `src/syscall/futex.rs` + +**Verdict: Baseline only — critical operations missing for desktop workloads.** + +| Operation | Status | Impact of Absence | +|-----------|--------|-------------------| +| `FUTEX_WAIT` (32-bit) | ✅ | — | +| `FUTEX_WAIT64` (64-bit) | ✅ | — | +| `FUTEX_WAKE` | ✅ | — | +| `FUTEX_REQUEUE` | ❌ returns EINVAL | `pthread_cond_broadcast` wakes ALL waiters (thundering herd) | +| `FUTEX_CMP_REQUEUE` | ❌ not defined | Same + atomicity gap | +| `FUTEX_WAKE_OP` | ❌ not defined | glibc mutex fast path unavailable | +| `FUTEX_WAIT_BITSET` | ❌ not defined | `pselect`/`ppoll` optimization unavailable | +| `FUTEX_WAKE_BITSET` | ❌ not defined | Targeted wake unavailable | +| `FUTEX_LOCK_PI` / `UNLOCK_PI` | ❌ not defined | Priority inversion unprotected | +| Robust futex list | ❌ not defined | Thread death → permanent deadlock | +| Futex sharding (per-futex lock) | ❌ single global L1 mutex | All futex ops on all CPUs contend on one lock | +| Process-private futexes | ❌ global table | Unnecessary cross-process visibility | + +**Architecture:** +``` +static FUTEXES: Mutex // single global lock +type FutexList = HashMap> +``` + +Physical address is the key (enables cross-address-space futex via MAP_SHARED). +Virtual address + Weak used for CoW disambiguation. + +**Recoverable work (lost patches):** + +| Feature | Lost Patch | Applies? | +|---------|-----------|----------| +| 64-shard hash table | P6-futex-sharding | ✅ cleanly | +| FUTEX_REQUEUE + CMP_REQUEUE | P8-futex-requeue | ❌ needs rebase | +| PI futex (LOCK_PI/UNLOCK_PI/TRYLOCK_PI) | P8-futex-pi | ❌ needs rebase | +| PI CAS fix | P9-futex-pi-cas-fix | ❌ needs rebase | +| Robust futex list | P8-futex-robust | ❌ needs rebase | + +The 4 failing patches likely fail because they depend on sharding (P6-futex-sharding) being +applied first. Apply in order: P6-sharding → P8-requeue → P8-pi → P8-robust → P9-pi-cas-fix. + +### 2.4 Kernel Syscall ABI Layer + +**Files:** `src/syscall/mod.rs`, `src/syscall/futex.rs`, `src/syscall/time.rs`, +`src/syscall/process.rs`, `local/sources/syscall/src/number.rs`, `src/scheme/proc.rs` + +**Verdict: Minimal surface — most threading done via proc scheme, not syscalls.** + +The kernel defines only ~35 syscall numbers. Threading-relevant ones: + +| Syscall | Status | Notes | +|---------|--------|-------| +| `SYS_FUTEX` (240) | ✅ partial | WAIT/WAIT64/WAKE only | +| `SYS_YIELD` (158) | ✅ | `context::switch()` + signal handler | +| `SYS_FMAP` (900) | ✅ | Anonymous + file-backed mmap | +| `SYS_FUNMAP` (92) | ✅ | munmap | +| `SYS_MPROTECT` (125) | ✅ | | +| `SYS_MREMAP` (155) | ✅ | | +| `SYS_NANOSLEEP` (162) | ✅ | EINTR-aware | +| `SYS_CLOCK_GETTIME` (265) | ✅ partial | REALTIME + MONOTONIC only | + +**Threading done via proc scheme (not syscalls):** + +| Operation | Mechanism | +|-----------|-----------| +| Thread/process creation | `proc:` scheme: open "new-context", share addr_space + files via kdup | +| waitpid | `proc:` scheme: `EVENT_READ` on context fd | +| getpid/gettid | `proc:` scheme: read "attrs" handle | +| kill/tkill | `proc:` scheme: `ForceKill` / `Interrupt` ContextVerb | +| CPU affinity | `proc:` scheme: write "sched-affinity" handle | +| Priority | `proc:` scheme: write "attrs" prio field | +| Signal setup | `proc:` scheme: write "sighandler" + shared Sigcontrol pages | +| TLS base (FSBASE) | `proc:` scheme: write "regs/env" EnvRegisters | + +**Completely missing syscalls (no number, no handler):** +`clone`, `fork`, `vfork`, `waitpid`, `wait4`, `kill`, `tkill`, `tgkill`, `arch_prctl`, +`set_thread_area`, `set_tid_address`, `set_robust_list`, `get_robust_list`, +`sched_setaffinity`, `sched_getaffinity`, `sched_setscheduler`, `sched_getparam`, +`sigaction`, `sigprocmask`, `sigpending`, `sigsuspend`, `sigtimedwait`, +`timer_create`, `timer_settime`, `timer_delete`, `timerfd_create`, +`getrusage`, `setrlimit`, `getrlimit`, `times` + +### 2.5 relibc Pthread Layer + +**Files:** `src/pthread/mod.rs`, `src/sync/*.rs`, `src/header/pthread/*.rs`, +`src/header/sched/mod.rs`, `src/ld_so/tcb.rs`, `src/platform/redox/mod.rs` + +**Verdict: Core pthreads solid, scheduling/robust/PI absent, several POSIX gaps.** + +#### Fully Working (futex-backed) + +| API Group | Backend | Notes | +|-----------|---------|-------| +| `pthread_create/join/detach/exit` | redox_rt clone + Waitval | Stack via mmap, TLS via Tcb::new() | +| `pthread_cancel/setcancelstate/testcancel` | SIGRT_RLCT_CANCEL (33) | Deferred cancellation only | +| `pthread_mutex_*` (normal/recursive/errorcheck) | AtomicU32 CAS + futex_wait/wake | 3-state: unlocked/locked/waiters | +| `pthread_cond_*` | Two-counter futex design | CLOCK_REALTIME only (monotonic = stub) | +| `pthread_rwlock_*` | AtomicU32 + futex | Reader count + WAITING_WR bit | +| `pthread_barrier_*` | Mutex + Cond | gen_id wrapping counter | +| `pthread_spin_*` | AtomicI32 CAS | No futex, pure spinning | +| `pthread_once` | 3-state futex (UNINIT→INITING→INIT) | | +| `pthread_key_create/getspecific/setspecific/delete` | BTreeMap global + thread_local values | Destructor iteration per POSIX | +| `pthread_sigmask` | Delegates to sigprocmask | | +| `pthread_kill` | redox_rt::rlct_kill | | +| `pthread_atfork` | Thread-local LinkedList hooks | | +| ELF TLS (`__thread` / `#[thread_local]`) | PT_TLS + Tcb | Static + dynamic DTV for dlopen | +| `pthread_attr_*` (getters/setters) | RlctAttr struct | | + +#### Stubs / No-ops / Missing + +| API | Status | Root Cause | +|-----|--------|------------| +| `sched_get_priority_max/min` | `todo!()` | Kernel has no scheduling policy API | +| `sched_getparam/setparam` | `todo!()` | Same | +| `sched_setscheduler` | `todo!()` | Same | +| `sched_rr_get_interval` | `todo!()` | Same | +| `pthread_setschedparam` | No-op (returns Ok) | Kernel ignores policy | +| `pthread_setschedprio` | No-op (returns Ok) | Kernel ignores priority change | +| `pthread_getschedparam` | `todo!()` | | +| `pthread_getcpuclockid` | ENOENT | No per-thread CPU clock | +| `pthread_mutex_consistent` | `todo_skip!` | Robust mutex not implemented | +| `pthread_mutex_getprioceiling` | `todo_skip!` | Priority ceiling not implemented | +| `pthread_mutex_setprioceiling` | `todo_skip!` | Same | +| `pthread_mutexattr_setprotocol` (PRIO_INHERIT) | Accepted, no-op | PI futex missing | +| `pthread_mutexattr_setrobust` (ROBUST) | Accepted, no-op | Robust futex missing | +| `pthread_cond_init` CLOCK_MONOTONIC | `todo_skip!` | | +| `pthread_cond_signal` | Calls broadcast (wakes ALL) | Missing FUTEX_REQUEUE optimization | +| `pthread_setaffinity_np` | Not defined | | +| `pthread_getaffinity_np` | Not defined | | +| `pthread_setname_np` | Not defined | | +| `pthread_getname_np` | Not defined | | +| `pthread_setcanceltype` | Always returns DEFERRED | ASYNC not tracked | +| Guard pages | Attribute stored, not mapped | No PROT_NONE page before stack | +| PTHREAD_KEYS_MAX limit | Not checked | | + +--- + +## 3. Gap Classification + +### 3.1 Correctness Gaps (Must Fix — Silent Data Corruption or Deadlock) + +| # | Gap | Impact | Fix Location | +|---|-----|--------|-------------| +| C1 | **No robust mutexes** | Thread death while holding mutex → permanent deadlock for all waiters | Kernel: robust futex list + relibc: pthread_mutex_consistent | +| C2 | **No PI futexes** | Priority inversion: low-prio thread blocks high-prio thread indefinitely | Kernel: FUTEX_LOCK_PI/UNLOCK_PI + relibc: mutexattr_setprotocol | +| C3 | **`pthread_cond_signal` wakes ALL** | Correctness: wastes CPU. Performance: thundering herd on every signal | relibc: use true wake(1) — may need FUTEX_REQUEUE | +| C4 | **`fork()` not thread-safe** | `pthread_atfork` hooks exist but child inherits locked mutexes | relibc: implement atfork child handlers properly | + +### 3.2 Performance Gaps (Must Fix for Desktop Responsiveness) + +| # | Gap | Impact | Fix Location | +|---|-----|--------|-------------| +| P1 | **No SMP load balancing** | Cores sit idle while others are overloaded | Kernel: work stealing + initial placement | +| P2 | **No futex sharding** | Single global L1 mutex for ALL futex ops on ALL CPUs | Kernel: 64-shard hash table | +| P3 | **No FUTEX_REQUEUE** | `pthread_cond_broadcast` wakes all → thundering herd | Kernel: REQUEUE + CMP_REQUEUE | +| P4 | **Full TLB flush on every shootdown** | Per-page mprotect/munmap flushes entire TLB on all cores | Kernel: INVLPG-based selective flush | +| P5 | **Global context switch lock** | Serialization bottleneck beyond ~8 cores | Kernel: per-CPU context switch (needs per-CPU run queues) | +| P6 | **All IRQs to BSP** | CPU 0 handles all interrupts, cache thrash, latency | Kernel: IRQ steering in I/O APIC + MSI/MSI-X dest field | +| P7 | **No RT scheduling** | Audio/compositor threads can't get priority | Kernel: SchedPolicy + RT dispatch + relibc: sched_setscheduler | + +### 3.3 POSIX Completeness Gaps (Must Fix for Application Compatibility) + +| # | Gap | Impact | Fix Location | +|---|-----|--------|-------------| +| X1 | `sched_*` all `todo!()` | Applications calling sched_setscheduler panic | relibc: implement via proc scheme | +| X2 | `pthread_setschedparam` no-op | Apps can't change thread priority | relibc: wire to proc scheme prio write | +| X3 | `pthread_setaffinity_np` missing | Apps can't pin threads to CPUs | relibc: implement via proc scheme affinity write | +| X4 | `pthread_setname_np` missing | Debugging harder (no thread names in /proc) | relibc: implement via proc scheme name write | +| X5 | `pthread_getcpuclockid` ENOENT | Per-thread profiling impossible | relibc + kernel: expose cpu_time via clock | +| X6 | Guard pages not mapped | Stack overflow → silent corruption, no SIGSEGV | relibc: mmap PROT_NONE guard page in pthread_create | +| X7 | `pthread_cond_init` monotonic stub | CLOCK_MONOTONIC condvars use REALTIME (affected by wall clock jumps) | relibc: implement monotonic condvar | + +--- + +## 4. Implementation Plan + +### Phase 0: Patch Recovery — Re-Apply Lost Threading Work (Week 1–2) + +**Goal:** Recover the P5–P9 work that was lost during the local fork migration. + +**This is the highest-priority phase — it restores ~6 months of work with minimal new code.** + +#### 0.1 — Re-apply kernel scheduler patches to local fork + +Apply in dependency order to `local/sources/kernel/`: + +| Order | Patch | Status | Action | +|-------|-------|--------|--------| +| 1 | P6-futex-sharding | ✅ applies | Commit directly | +| 2 | P6-percpu-runqueues | ✅ applies | Commit directly | +| 3 | P8-percpu-sched | ✅ applies | Commit directly | +| 4 | P8-percpu-wiring | ✅ applies | Commit directly | +| 5 | P8-initial-placement | ✅ applies | Commit directly | +| 6 | P5-sched-rt-policy | ✅ applies | Commit directly | +| 7 | P5-context-mod-sched | ✅ applies | Commit directly | +| 8 | P6-vruntime-switch | ✅ applies | Commit directly | +| 9 | P7-cache-affine-switch | ✅ applies | Commit directly | +| 10 | P9-numa-topology | ✅ applies | Commit directly | +| 11 | P9-proc-lock-ordering | ✅ applies | Commit directly | +| 12 | P8-work-stealing | ❌ needs rebase | Rebase against 1–11, then apply | +| 13 | P8-futex-requeue | ❌ needs rebase | Rebase against P6-sharding (#1), then apply | +| 14 | P8-futex-pi | ❌ needs rebase | Rebase against #13, then apply | +| 15 | P8-futex-robust | ❌ needs rebase | Rebase against #14, then apply | +| 16 | P9-futex-pi-cas-fix | ❌ needs rebase | Rebase against #14, then apply | +| 17 | P7-scheduler-improvements | ❌ needs rebase | Rebase against 1–11, then apply | + +**Verification after each patch:** +```bash +cd local/sources/kernel +cargo check # must pass +``` + +#### 0.2 — Re-apply relibc threading patches to local fork + +Apply to `local/sources/relibc/`: + +| Patch | Action | +|-------|--------| +| P3-threads.patch | ✅ applies — commit | +| P3-barrier-smp-futex (from absorbed/) | Verify already in fork; if not, apply | +| P3-pthread-signal-races (from absorbed/) | Verify already in fork | +| P3-pthread-yield (from absorbed/) | Verify already in fork | +| P5-robust-mutexes (from absorbed/) | Verify; re-apply if missing | +| P5-robust-mutex-enotrec-fix (from absorbed/) | Same | +| P5-sched-api (from absorbed/) | Same | +| P7-pthread-affinity (from absorbed/) | Same | +| P7-pthread-setname (from absorbed/) | Same | +| P7-setpriority (from absorbed/) | Same | +| P9-spin-and-barrier (from absorbed/) | Same | +| P9-spin-fix (from absorbed/) | Same | +| P3-semaphore-comprehensive | ✅ applies | + +**Verification:** +```bash +cd local/sources/relibc +make all # must pass +touch relibc && make prefix # rebuild prefix with new libc +``` + +#### 0.3 — Build and smoke test + +```bash +export REDBEAR_ALLOW_PROTECTED_FETCH=1 +./local/scripts/build-redbear.sh --upstream redbear-mini +make qemu # verify boot + basic operation +``` + +**Success criteria:** redbear-mini boots, multi-threaded daemons (pcid, xhcid) start, no kernel panic. + +--- + +### Phase 1: Futex Completeness (Week 2–4) + +**Goal:** Close the futex operation gaps that affect correctness and performance. + +**Depends on:** Phase 0 complete (sharding applied first). + +#### 1.1 — FUTEX_REQUEUE + FUTEX_CMP_REQUEUE + +**Kernel:** `src/syscall/futex.rs` +- Add `FUTEX_REQUEUE` and `FUTEX_CMP_REQUEUE` to the futex dispatcher +- Implement: move up to `val` waiters from addr1 → addr2, optionally compare `*addr1 == val2` +- Requires locking TWO shards (acquire both in deterministic order to avoid deadlock) + +**relibc:** `src/sync/cond.rs` +- Change `pthread_cond_broadcast` to use `FUTEX_REQUEUE` (move waiters from condvar futex to mutex futex) +- Change `pthread_cond_signal` to wake exactly 1 (not all) + +**Impact:** Eliminates thundering herd on every `pthread_cond_broadcast`. Major win for Qt event loop, KWin compositor, Mesa worker threads. + +#### 1.2 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI) + +**Kernel:** `src/syscall/futex.rs` +- Add `PiState` tracking per futex: owner context + waiter list with priorities +- On `LOCK_PI` block: boost owner's priority to waiter's priority +- On `UNLOCK_PI`: restore original priority, wake highest-priority waiter +- Requires kernel RT scheduling (Phase 0.1 #6–7: P5-sched-rt-policy) + +**relibc:** `src/sync/pthread_mutex.rs` +- Implement `PTHREAD_PRIO_INHERIT` protocol path using PI futex +- Replace `todo_skip!` in `pthread_mutex_consistent` with real implementation + +#### 1.3 — Robust Futex List + +**Kernel:** `src/syscall/futex.rs` + `src/context/context.rs` +- Add `robust_list_head: Option` to `Context` struct +- Implement `set_robust_list` / `get_robust_list` via proc scheme or syscall +- On thread exit (`exit_this_context`): walk robust list, set `FUTEX_OWNER_DIED` bit, wake one waiter with `EOWNERDEAD` + +**relibc:** `src/sync/pthread_mutex.rs` +- Implement robust list registration in `pthread_mutex_lock` +- Implement `pthread_mutex_consistent`: clear `EOWNERDEAD` state +- Replace `todo_skip!` with real implementation + +#### 1.4 — FUTEX_WAKE_OP + +**Kernel:** `src/syscall/futex.rs` +- Implement atomic op + wake: perform op on addr2, then wake up to `val` waiters on addr1 +- Operations: set, add, or, andn, xor, with comparison condition + +**Impact:** glibc mutex fast path optimization. Not critical for relibc but helps ported glibc-linked binaries. + +--- + +### Phase 2: SMP Scheduling Quality (Week 3–6) + +**Goal:** Make multi-core actually distribute work. + +**Depends on:** Phase 0 complete (per-CPU queues applied). + +#### 2.1 — Work stealing (recover + fix) + +**Kernel:** `src/context/switch.rs` +- On `select_next_context()` empty local queue: steal from victim CPU +- Pick victim by round-robin, steal highest-priority runnable context +- Limit steal batch size (1–2 contexts per steal attempt) +- Send `IpiKind::Wakeup` to target CPU if stealing woke it from idle + +**Recovery:** P8-work-stealing needs rebase against per-CPU wiring. + +#### 2.2 — Load balancing (recover + verify) + +**Kernel:** `src/context/switch.rs` +- Periodic balance trigger (every N ticks or when queue depth difference > threshold) +- Migrate contexts from overloaded CPU to most-idle CPU +- Respect `sched_affinity` mask during migration + +**Recovery:** P8-load-balance is in absorbed/ — verify it's in the fork after Phase 0. + +#### 2.3 — Reschedule IPI + +**Kernel:** `src/arch/x86_shared/ipi.rs` + `src/context/switch.rs` +- When waking a context on a different CPU, send `IpiKind::Switch` to that CPU +- Currently the Switch IPI exists but is not used by the scheduler + +#### 2.4 — Per-page TLB flush (INVLPG) + +**Kernel:** `rmm/src/arch/x86_64.rs` + `src/context/memory.rs` +- Add `invalidate_page(addr)` using `invlpg` instruction +- Modify `Flusher` to track individual pages and use INVLPG when ≤ N pages affected +- Fall back to CR3 reload only for large-scale invalidations + +**Impact:** Every `mprotect`/`mmap`/`munmap` on a multi-threaded process currently flushes the ENTIRE TLB on every core. This is one of the most impactful single fixes. + +#### 2.5 — TLB broadcast optimization + +**Kernel:** `src/percpu.rs` +- Replace per-CPU sequential `shootdown_tlb_ipi(Some(id))` loop with ICR "all excluding self" (destination shorthand 0b11) +- Single IPI + global ack counter instead of N individual IPIs + N ack counters + +--- + +### Phase 3: RT Scheduling (Week 4–6) + +**Goal:** Allow applications to request real-time scheduling for latency-sensitive threads. + +**Depends on:** Phase 0 (SchedPolicy applied) + Phase 2 (per-CPU queues). + +#### 3.1 — Kernel RT scheduling dispatch + +**Kernel:** `src/context/switch.rs` (from P5-sched-rt-policy — recovered in Phase 0) +- `select_next_context()` passes: + 1. SCHED_FIFO contexts (highest RT priority first, no preemption within same prio) + 2. SCHED_RR contexts (highest RT priority first, round-robin within same prio) + 3. SCHED_OTHER contexts (existing DWRR/vruntime) +- SCHED_RR quantum: configurable per-context (default 100ms) + +#### 3.2 — relibc sched_* API completion + +**relibc:** `src/header/sched/mod.rs` + +Replace ALL `todo!()` stubs: + +| Function | Implementation | +|----------|---------------| +| `sched_getscheduler(pid)` | Read policy from proc scheme attrs | +| `sched_setscheduler(pid, policy, param)` | Write policy + RT priority via proc scheme | +| `sched_getparam(pid, param)` | Read RT priority from proc scheme | +| `sched_setparam(pid, param)` | Write RT priority via proc scheme | +| `sched_get_priority_max(policy)` | Return 99 for FIFO/RR, 0 for OTHER | +| `sched_get_priority_min(policy)` | Return 1 for FIFO/RR, 0 for OTHER | +| `sched_rr_get_interval(pid, tp)` | Return SCHED_RR quantum (100ms default) | + +#### 3.3 — pthread_setschedparam wiring + +**relibc:** `src/pthread/mod.rs` +- Replace `set_sched_param` no-op with real proc scheme call +- Replace `set_sched_priority` no-op with real proc scheme call + +--- + +### Phase 4: POSIX Pthread Completeness (Week 5–8) + +**Goal:** Close remaining POSIX gaps that block application compatibility. + +**Depends on:** Phase 0 + Phase 3 (for sched API). + +#### 4.1 — pthread_setaffinity_np / pthread_getaffinity_np + +**relibc:** `src/header/pthread/mod.rs` + `src/header/sched/mod.rs` +- Implement using proc scheme "sched-affinity" write/read +- Define `cpu_set_t` type and `CPU_SET/CPU_CLR/CPU_ZERO/CPU_ISSET` macros + +#### 4.2 — pthread_setname_np / pthread_getname_np + +**relibc:** `src/header/pthread/mod.rs` +- Implement using proc scheme name write/read (kernel already supports 32-char name field) + +#### 4.3 — pthread_cond_init CLOCK_MONOTONIC + +**relibc:** `src/sync/cond.rs` +- Replace `todo_skip!` with real monotonic clock support +- Store clock choice in cond struct, use `CLOCK_MONOTONIC` for deadline calculations + +#### 4.4 — Guard pages + +**relibc:** `src/pthread/mod.rs` +- In `pthread_create`, when allocating stack via mmap: + - Map `[stack_base, stack_base + guard_size)` with `PROT_NONE` + - Map `[stack_base + guard_size, stack_base + guard_size + stack_size)` with `PROT_READ | PROT_WRITE` +- On thread exit, munmap both regions + +#### 4.5 — pthread_getcpuclockid + +**relibc:** `src/header/pthread/mod.rs` +- Return `CLOCK_THREAD_CPUTIME_ID` (requires kernel support — add clock to `clock_gettime`) + +**Kernel:** `src/syscall/time.rs` +- Add `CLOCK_THREAD_CPUTIME_ID` → read `context.cpu_time` + +#### 4.6 — PTHREAD_KEYS_MAX enforcement + +**relibc:** `src/header/pthread/tls.rs` +- Check `NEXTKEY` against `PTHREAD_KEYS_MAX` (1024) before allocating + +--- + +### Phase 5: IRQ Steering and NUMA (Week 8–12) + +**Goal:** Distribute interrupt load and respect memory locality. + +**Depends on:** Phase 2 (per-CPU infrastructure). + +#### 5.1 — IRQ steering + +**Kernel:** `src/arch/x86_shared/device/ioapic.rs` + `src/arch/x86_shared/idt.rs` +- Change I/O APIC redirection `dest` from `bsp_apic_id` to round-robin or RSS hash +- Add per-CPU legacy IRQ handlers in IDT (not just BSP) +- For MSI/MSI-X: set destination CPU in Message Address register + +#### 5.2 — NUMA topology discovery + +**Kernel:** `src/acpi/` (from P9-numa-topology — recovered in Phase 0) +- Parse SRAT (Static Resource Affinity Table) for proximity domains +- Parse SLIT (System Locality Distance Information Table) for inter-node distances +- Store `NumaTopology` in kernel for O(1) scheduling lookups + +#### 5.3 — NUMA-aware memory allocation + +**Kernel:** `src/memory/` + frame allocator +- Track frame NUMA node in `Frame` or `PageInfo` +- On allocation, prefer frames from requesting CPU's NUMA node +- Fallback to remote node when local node is exhausted + +--- + +## 5. Dependency Chain + +``` +Phase 0 (Patch Recovery) ← BLOCKING FOR ALL OTHERS + │ + ├──► Phase 1 (Futex Completeness) + │ │ + │ ├──► 1.1 REQUEUE ──► condvar performance + │ ├──► 1.2 PI ──► priority inversion fix (needs Phase 3.1) + │ ├──► 1.3 Robust ──► deadlock prevention + │ └──► 1.4 WAKE_OP ──► glibc compat + │ + ├──► Phase 2 (SMP Scheduling) + │ │ + │ ├──► 2.1 Work stealing ──► core utilization + │ ├──► 2.2 Load balancing ──► fair distribution + │ ├──► 2.3 Reschedule IPI ──→ cross-CPU wakeup + │ ├──► 2.4 Per-page TLB ──► mmap/mprotect performance + │ └──► 2.5 TLB broadcast ──► IPI efficiency + │ + ├──► Phase 3 (RT Scheduling) + │ │ + │ ├──► 3.1 Kernel RT dispatch (from Phase 0) + │ ├──► 3.2 relibc sched_* API ──► POSIX compat + │ └──► 3.3 pthread_setschedparam ──► app priority control + │ + ├──► Phase 4 (POSIX Pthread Completeness) + │ │ + │ ├──► 4.1 Affinity API ──► CPU pinning + │ ├──► 4.2 Thread naming ──► debuggability + │ ├──► 4.3 Monotonic condvar ──► clock correctness + │ ├──► 4.4 Guard pages ──► stack overflow detection + │ ├──► 4.5 CPU clock ──► per-thread profiling + │ └──► 4.6 Keys max ──► resource limit + │ + └──► Phase 5 (IRQ + NUMA) + │ + ├──► 5.1 IRQ steering ──► interrupt distribution + ├──► 5.2 NUMA topology ──► (from Phase 0) + └──► 5.3 NUMA allocator ──► memory locality +``` + +**Parallel work possible:** +- Phase 1 + Phase 2 + Phase 3 can run in parallel after Phase 0 +- Phase 4 items are independent of each other +- Phase 5 depends on Phase 2 but not on Phase 1/3/4 + +--- + +## 6. Validation Plan + +### 6.1 Build Evidence + +| Check | Command | +|-------|---------| +| Kernel compiles | `make r.kernel` | +| relibc compiles | `make r.relibc` | +| Prefix rebuilt | `touch relibc kernel && make prefix` | +| Full OS builds | `make all CONFIG_NAME=redbear-mini` | + +### 6.2 Runtime Evidence (QEMU) + +| Test | Verification | +|------|-------------| +| Multi-threaded boot | `make qemu QEMUFLAGS="-smp 4"` — all 4 CPUs active | +| pthread smoke test | Guest: compile + run simple pthread_create/join/mutex test | +| Work stealing | Guest: spawn 8 threads on 4-CPU QEMU, verify all CPUs utilized | +| Futex REQUEUE | Guest: condvar broadcast benchmark — waiters wake in ≤2 batches, not N | +| PI futex | Guest: priority inversion test — high-prio thread unblocked within 1 tick | +| Robust mutex | Guest: kill thread holding mutex, verify EOWNERDEAD recovery | +| RT scheduling | Guest: SCHED_FIFO thread preempts SCHED_OTHER within 100μs | +| CPU affinity | Guest: pin thread to CPU 1, verify it never runs on CPU 0 | +| Thread naming | Guest: `cat /scheme/proc/*/name` shows set names | +| Guard pages | Guest: overflow stack, verify SIGSEGV (not silent corruption) | +| TLB efficiency | Guest: mprotect benchmark — compare TLB miss rate before/after | + +### 6.3 Validation Scripts (to create) + +```bash +local/scripts/test-threading-qemu.sh # Comprehensive threading smoke test +local/scripts/test-futex-requeue-qemu.sh # REQUEUE-specific test +local/scripts/test-futex-pi-qemu.sh # PI futex test +local/scripts/test-futex-robust-qemu.sh # Robust mutex test +local/scripts/test-sched-rt-qemu.sh # RT scheduling latency test +local/scripts/test-sched-balance-qemu.sh # Load balancing on multi-vCPU +local/scripts/test-threading-baremetal.sh # Bare metal multi-threaded stress +``` + +--- + +## 7. Estimated Effort + +| Phase | Duration | New Code | Recovery | Dependencies | +|-------|----------|----------|----------|-------------| +| Phase 0: Patch Recovery | 1–2 weeks | Minimal (rebase 5 patches) | 13 patches apply directly | None | +| Phase 1: Futex Completeness | 2–3 weeks | REQUEUE impl + WAKE_OP | PI/robust from P8 patches | Phase 0 | +| Phase 2: SMP Scheduling | 3–4 weeks | TLB INVLPG + broadcast opt | Work stealing from P8 | Phase 0 | +| Phase 3: RT Scheduling | 1–2 weeks | relibc sched_* API | RT dispatch from P5 | Phase 0 | +| Phase 4: POSIX Pthread | 2–3 weeks | Affinity/naming/guard/clock | Partial from P7 patches | Phase 0, 3 | +| Phase 5: IRQ + NUMA | 3–4 weeks | IRQ steering + NUMA allocator | NUMA topology from P9 | Phase 0, 2 | + +**Total:** 12–18 weeks with 1–2 developers. Phase 0 alone recovers the majority of the value in 1–2 weeks. + +--- + +## 8. Integration with Existing Plans + +| Plan | Relationship | +|------|-------------| +| `CONSOLE-TO-KDE-DESKTOP-PLAN.md` | **Consumer** — Phase 3 (KWin) needs PI futex + RT scheduling; Phase 2 (compositor) needs work stealing | +| `IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md` | **Sibling** — IRQ steering (Phase 5.1) belongs to both plans | +| `DRM-MODERNIZATION-EXECUTION-PLAN.md` | **Consumer** — GPU worker threads benefit from load balancing + affinity | +| `IMPLEMENTATION-MASTER-PLAN.md` | **Parent** — this plan covers the kernel threading substrate | +| `CPU-DMA-IRQ-MSI-SCHEDULER-FIX-PLAN.md` | **Sibling** — overlaps on scheduler/IRQ delivery | + +--- + +## 9. Bottom Line + +The Red Bear OS threading stack is **functional for basic single-threaded and lightly-threaded +workloads**. The SMP boot, context switching, TLB shootdown, and basic futex operations are +correct. + +The **critical problem** is that 6 months of threading enhancement work (P5–P9 patches) was +lost during the local fork migration. This work exists as patch files that apply cleanly to +the current fork — **Phase 0 (Patch Recovery) is the single highest-ROI action**. + +After Phase 0, the remaining gaps are: +1. **Futex REQUEUE/PI/robust** — for condvar performance and deadlock prevention +2. **SMP work stealing + load balancing** — for multi-core utilization +3. **RT scheduling** — for audio/compositor thread priority +4. **POSIX pthread completeness** — for application compatibility +5. **IRQ steering + NUMA** — for multi-socket performance + +The **desktop-critical path** (KWin responsiveness) requires Phases 0–3. The +**server-critical path** (multi-socket, NUMA) adds Phase 5. Phase 4 (POSIX completeness) +benefits all paths but is not desktop-blocking. diff --git a/local/recipes/archives/uutils-tar/source b/local/recipes/archives/uutils-tar/source deleted file mode 160000 index e4c2affa98..0000000000 --- a/local/recipes/archives/uutils-tar/source +++ /dev/null @@ -1 +0,0 @@ -Subproject commit e4c2affa98175249af3789f13737a3f1e58c1917 diff --git a/local/recipes/dev/ninja-build/source b/local/recipes/dev/ninja-build/source deleted file mode 160000 index d829f42b8d..0000000000 --- a/local/recipes/dev/ninja-build/source +++ /dev/null @@ -1 +0,0 @@ -Subproject commit d829f42b8dcf6d2114b23e0c195eb395254a21ca diff --git a/local/recipes/kde/sddm/source b/local/recipes/kde/sddm/source deleted file mode 160000 index 63780fcd79..0000000000 --- a/local/recipes/kde/sddm/source +++ /dev/null @@ -1 +0,0 @@ -Subproject commit 63780fcd79f1dbf81a30eef48c28c699ab15aded diff --git a/local/recipes/kde/sddm/source-pristine b/local/recipes/kde/sddm/source-pristine deleted file mode 160000 index 63780fcd79..0000000000 --- a/local/recipes/kde/sddm/source-pristine +++ /dev/null @@ -1 +0,0 @@ -Subproject commit 63780fcd79f1dbf81a30eef48c28c699ab15aded diff --git a/local/reference/linux-7.1 b/local/reference/linux-7.1 deleted file mode 160000 index ab9de95c9c..0000000000 --- a/local/reference/linux-7.1 +++ /dev/null @@ -1 +0,0 @@ -Subproject commit ab9de95c9cf952332ab79453b4b5d1bfca8e514f diff --git a/local/reference/seL4 b/local/reference/seL4 deleted file mode 160000 index a0b4f2d25d..0000000000 --- a/local/reference/seL4 +++ /dev/null @@ -1 +0,0 @@ -Subproject commit a0b4f2d25dc975f6a9198c081359c0e38e5614fb diff --git a/local/sources/libredox b/local/sources/libredox new file mode 160000 index 0000000000..d01da350c1 --- /dev/null +++ b/local/sources/libredox @@ -0,0 +1 @@ +Subproject commit d01da350c18c2ab0709923dac602b2264a6b4530