Files
RedBear-OS/local/docs/MULTITHREADING-COMPREHENSIVE-ASSESSMENT-AND-PLAN.md
T
vasilito 533a1c2969 docs: update multi-threading plan with Phase 0c status
Prepend an UPDATE block at the top of the plan document recording:

  - The 8 commits that landed Phase 0c (futex sharding, per-CPU run
    queues, vruntime, work stealing, load balancing, cache-affine,
    initial placement, NUMA topology, proc scheme handles, fadt fix)
  - The upstream-redox kernel audit finding (upstream has none of
    these features; local fork is sole implementation)
  - A plan-vs-actual state table showing which claimed 'missing'
    features are now present
  - The kernel-side Phase 0c is complete; remaining work is
    relibc-side (Phase 0e) and futex-REQUEUE/PI/robust (Phase 1)

The detailed §1–§9 analysis is preserved unchanged as historical
record. The status column 'Missing' in §1 should be re-read as
'now present in local kernel fork, pending relibc userspace wiring.'

cargo check now exits 0 with 0 errors in the local kernel fork.
2026-07-02 07:01:16 +03:00

38 KiB
Raw Blame History

Red Bear OS — Multi-Threading Comprehensive Assessment and Implementation Plan

Date: 2026-07-02 (initial assessment); 2026-07-02 (Phase 0c patch recovery complete) Scope: Full-stack multi-threading audit: hardware/SMP, kernel scheduler, kernel futex, kernel syscall ABI, relibc pthreads, userspace threading correctness and performance Status: Authoritative — supersedes archived/KERNEL-SCHEDULER-MULTITHREAD-IMPROVEMENT-PLAN.md and archived/SCHEDULER-REVIEW-FINAL.md for all threading matters Validation levels: buildsenumeratesusablevalidatedhardware-validated


UPDATE — Phase 0c Patch Recovery (2026-07-02)

The assessment in §1 below was accurate as of the time of writing. The Phase 0c patch recovery was then executed in the same session, landing the following commits on the local kernel fork (local/sources/kernel/):

Commit Effect
ed3f0e1 P6-futex-sharding: replaces single global Mutex<L1, FutexList> with 64-shard hash table
5fb42fc RUN_QUEUE_COUNT pre-flight: defines pub const RUN_QUEUE_COUNT: usize = 40; (was missing from patch chain)
cbf051e P7-cache-affine-context (manual): surgically inserts SchedPolicy enum, SCHED_PRIORITY_LEVELS, helper functions, and 9 new Context fields (last_cpu, sched_policy, sched_rt_priority, sched_rr_ticks_consumed, sched_static_prio, sched_rr_quantum, vruntime, futex_pi_*, PhysicalAddress import)
f7652fc P5-context-mod-sched + P8-percpu-sched + P8-percpu-wiring: PerCpuSched struct with SyncUnsafeCell-wrapped per-CPU run queues, get_percpu_block helper, full per-CPU scheduler wiring in switch.rs (pick_next_from_queues, pick_next_from_global_queues, select_next_context)
7fc8bbf P8-initial-placement + P9-numa-topology + P9-proc-lock-ordering: least-loaded-CPU spawn, NUMA topology hints, proc scheme lock order fix
327c150 set_sched_policy + set_sched_other_prio: missing Context methods called by proc scheme handles
e8ec916 fadt usize/u32 type mismatch fix: changes FADT_MIN_SIZE constants to u32 to match Sdt::length()
4789d54 SchedPolicy/Name/Priority proc scheme handles: adds /proc/<tid>/{name, sched-policy, priority} paths and read/write handlers

Upstream check (bg_27f3578a, 2026-07-02): verified that gitlab.redox-os.org/redox-os/kernel master (commit aa7e7d2f44ba7cd9d1b007d37db139b345d46b8a) has NONE of these features. The local fork is the sole implementation. No upstream cherry-picks are available.

Plan-vs-actual state:

Plan claim (Section 1) Actual state after Phase 0c
"Baseline DWRR scheduler only (no per-CPU queues, no work stealing, no load balancing, no vruntime, no RT scheduling, no cache-affine)" All 5 features now present. Per-CPU PerCpuSched with run_queues, steal_work(), migrate_one_context(), maybe_balance_queues(), vruntime CFS-style weighting, last_cpu cache-affine vruntime bonus, SchedPolicy::Fifo/RoundRobin RT scanning in pick_next_from_queues
"Baseline futex only (WAIT/WAIT64/WAKE — no sharding, no PI, no REQUEUE, no robust, no WAKE_OP, no BITSET)" 🟡 Sharding done (64-shard hash). REQUEUE/PI/robust/WAKE_OP/BITSET still missing.
"relibc sched_* are all todo!(), pthread_setschedparam is a no-op, robust mutexes are todo_skip!, PI is absent" 🟡 relibc fork only has the pthread_cond_signal POSIX fix so far. sched_*, robust, PI still pending in relibc.
" Missing from relibc: CPU affinity API, Thread naming" 🟡 Kernel side done/proc/<tid>/{sched-policy, name, priority} handles. relibc pthread_setname_np / pthread_setaffinity_np / sched_setscheduler still pending.
"cargo check has 1 pre-existing error" Fixedcargo check now exits 0 with 0 errors.

Phase 0c status: kernel side complete (all 8 of 8 applicable kernel P5P9 patches re-applied or made obsolete by the existing refactored scheduler). Remaining work is in the relibc fork (Phase 0e) and the futex-REQUEUE/PI/robust work (Phase 1).

The detailed analysis in §1–§9 below is preserved as historical record. The status column "🚧 Missing" in §1 should be re-read as "now present in the local kernel fork, pending relibc userspace wiring."



1. Executive Summary

The Critical Finding — Lost Threading Work

The P5P9 scheduler and futex enhancement work (documented as "complete" in the archived plans) was lost during the local fork migration (2026-06). The local forks at local/sources/kernel/ and local/sources/relibc/ were created from upstream Redox baselines that did NOT include the Red Bear enhancement patches. The patches exist in local/patches/kernel/ and local/patches/relibc/ but are not wired into the recipes (both recipe.toml files use path = "..." with no patches = [...] list).

Impact: The running kernel has:

  • Baseline DWRR scheduler only (no per-CPU queues, no work stealing, no load balancing, no vruntime, no RT scheduling, no cache-affine)
  • Baseline futex only (WAIT/WAIT64/WAKE — no sharding, no PI, no REQUEUE, no robust, no WAKE_OP, no BITSET)
  • relibc sched_* are all todo!(), pthread_setschedparam is a no-op, robust mutexes are todo_skip!, PI is absent

Recovery: 13 of 18 kernel P5P9 patches apply cleanly to the current fork. 5 fail due to patch-chain dependencies (they expect earlier patches applied first). The bulk of the work is recoverable by re-applying patches to the forks and committing them.

What Actually Works Today

Layer Status Detail
SMP boot Solid INIT→SIPI sequence correct, per-CPU PCR via GS_BASE, x2APIC support
Context switching Solid FPU/SIMD/AVX state save via XSAVE, FSBASE/GSBASE swap (FSGSBASE or MSR), correct callee-saved register save
TLB shootdown protocol Correct AtomicBool flag + IPI + ack counter with fence(SeqCst) race prevention
Basic thread lifecycle Functional pthread_create/join/detach/exit through proc scheme + redox_rt clone
Basic synchronization Functional Futex-backed mutex, condvar, rwlock, barrier, spinlock, once
TLS Functional ELF PT_TLS + pthread_key_create/getspecific/setspecific
Per-CPU data Functional PercpuBlock via GS_BASE, all per-CPU state accessible
Signal delivery Functional Shared-memory Sigcontrol pages, per-thread masks, trampoline
Scheduler algorithm 🚧 Basic DWRR 40 priority levels, geometric weights, cooperative preemption (3-tick quantum)
Futex operations 🚧 Basic only WAIT/WAIT64/WAKE with single global mutex
SMP load balancing Missing No work stealing, no migration, contexts stuck on birth CPU
RT scheduling Missing No SCHED_FIFO/SCHED_RR, no kernel policy dispatch
Futex REQUEUE Missing Condvar broadcast causes thundering herd
Robust mutexes Missing Thread death while holding mutex → permanent deadlock
PI futexes Missing No priority inheritance → priority inversion risk
CPU affinity API Missing from relibc Kernel supports sched_affinity field but no userspace API
Thread naming Missing from relibc Kernel supports name field but no userspace API
Per-page TLB flush Missing invalidate_all() = full CR3 reload on every shootdown
NUMA awareness Missing No SRAT/SLIT, no proximity domains, flat memory model
IRQ balancing Missing All legacy IRQs hardwired to BSP

2. Layer-by-Layer Assessment

2.1 Hardware / SMP Layer

Files: src/acpi/madt/arch/x86.rs, src/arch/x86_shared/start.rs, src/arch/x86_shared/device/local_apic.rs, src/arch/x86_shared/device/ioapic.rs, src/arch/x86_shared/ipi.rs, src/arch/x86_shared/interrupt/ipi.rs, src/percpu.rs, src/arch/x86_shared/gdt.rs

Verdict: Functional foundation, performance gaps.

Component Status Detail
AP boot (INIT/SIPI) validated Correct trampoline at 0x8000, per-AP PCR/IDT/stack allocation
x2APIC mode builds Detected via CPUID, MSR-based access, APIC ID detection
Per-CPU PCR via GS_BASE validated PercpuBlock::current() reads from PCR, SWAPGS protocol correct
IPI send/receive functional 5 IPI kinds (Wakeup/Tlb/Switch/Pit/Profile), broadcast + unicast
TLB shootdown correct AtomicBool + IPI + ack with fence(SeqCst) race prevention
TLB granularity coarse Full CR3 reload (mov cr3, cr3) on every shootdown — no INVLPG
TLB broadcast 🚧 sequential Iterates CPUs individually, doesn't use ICR "all excluding self" shorthand
IRQ routing BSP-only Legacy I/O APIC entries hardcode dest: bsp_apic_id
NUMA absent No SRAT/SLIT, no proximity domains
SMT/HT topology absent No cache hierarchy, no hyperthread awareness
Idle loop functional MWAIT with deepest C-state or HLT fallback
W^X for trampoline 🚧 minor Trampoline page briefly W+X, unmapped after AP boot

2.2 Kernel Scheduler Layer

Files: src/context/switch.rs, src/context/mod.rs, src/context/context.rs, src/context/timeout.rs

Verdict: Correct but primitive — DWRR only, no SMP balancing, no RT classes.

Algorithm: Deficit Weighted Round Robin (DWRR)

  • 40 priority levels, each a VecDeque<WeakContextRef>
  • Geometric weights: SCHED_PRIO_TO_WEIGHT[i] ≈ 1.25^i (88761 → 15)
  • Per-CPU balance accumulator drives dequeue decisions
  • Quantum: 3 PIT ticks (~12.2ms) per scheduling round
  • Cooperative preemption: preempt_locks > 0 disables preemption

Global locks:

  • RUN_CONTEXTS: Mutex<L1, RunContextData> — all 40 priority queues under one L1 lock
  • IDLE_CONTEXTS: Mutex<L2, VecDeque<WeakContextRef>> — sleeping contexts
  • CONTEXT_SWITCH_LOCK: AtomicBool — global CAS spinlock serializing all context switches

What's missing (all was in lost P5P9 work):

Gap Lost Patch Recoverable?
Per-CPU run queues (eliminate global L1) P6-percpu-runqueues, P8-percpu-sched, P8-percpu-wiring applies cleanly
Work stealing P8-work-stealing needs rebase (depends on per-CPU wiring)
Initial placement (least-loaded CPU) P8-initial-placement applies cleanly
Load balancing P8-load-balance (absorbed) needs verification
Vruntime tracking + min-vruntime selection P6-vruntime-switch applies cleanly
SchedPolicy enum (FIFO/RR/Other) P5-sched-rt-policy applies cleanly
RT scheduling dispatch P5-sched-rt-policy applies cleanly
Cache-affine scheduling P7-cache-affine-switch applies cleanly
NUMA topology hints P9-numa-topology applies cleanly

2.3 Kernel Futex Layer

File: src/syscall/futex.rs

Verdict: Baseline only — critical operations missing for desktop workloads.

Operation Status Impact of Absence
FUTEX_WAIT (32-bit)
FUTEX_WAIT64 (64-bit)
FUTEX_WAKE
FUTEX_REQUEUE returns EINVAL pthread_cond_broadcast wakes ALL waiters (thundering herd)
FUTEX_CMP_REQUEUE not defined Same + atomicity gap
FUTEX_WAKE_OP not defined glibc mutex fast path unavailable
FUTEX_WAIT_BITSET not defined pselect/ppoll optimization unavailable
FUTEX_WAKE_BITSET not defined Targeted wake unavailable
FUTEX_LOCK_PI / UNLOCK_PI not defined Priority inversion unprotected
Robust futex list not defined Thread death → permanent deadlock
Futex sharding (per-futex lock) single global L1 mutex All futex ops on all CPUs contend on one lock
Process-private futexes global table Unnecessary cross-process visibility

Architecture:

static FUTEXES: Mutex<L1, FutexList>  // single global lock
type FutexList = HashMap<PhysicalAddress, Vec<FutexEntry>>

Physical address is the key (enables cross-address-space futex via MAP_SHARED). Virtual address + Weak used for CoW disambiguation.

Recoverable work (lost patches):

Feature Lost Patch Applies?
64-shard hash table P6-futex-sharding cleanly
FUTEX_REQUEUE + CMP_REQUEUE P8-futex-requeue needs rebase
PI futex (LOCK_PI/UNLOCK_PI/TRYLOCK_PI) P8-futex-pi needs rebase
PI CAS fix P9-futex-pi-cas-fix needs rebase
Robust futex list P8-futex-robust needs rebase

The 4 failing patches likely fail because they depend on sharding (P6-futex-sharding) being applied first. Apply in order: P6-sharding → P8-requeue → P8-pi → P8-robust → P9-pi-cas-fix.

2.4 Kernel Syscall ABI Layer

Files: src/syscall/mod.rs, src/syscall/futex.rs, src/syscall/time.rs, src/syscall/process.rs, local/sources/syscall/src/number.rs, src/scheme/proc.rs

Verdict: Minimal surface — most threading done via proc scheme, not syscalls.

The kernel defines only ~35 syscall numbers. Threading-relevant ones:

Syscall Status Notes
SYS_FUTEX (240) partial WAIT/WAIT64/WAKE only
SYS_YIELD (158) context::switch() + signal handler
SYS_FMAP (900) Anonymous + file-backed mmap
SYS_FUNMAP (92) munmap
SYS_MPROTECT (125)
SYS_MREMAP (155)
SYS_NANOSLEEP (162) EINTR-aware
SYS_CLOCK_GETTIME (265) partial REALTIME + MONOTONIC only

Threading done via proc scheme (not syscalls):

Operation Mechanism
Thread/process creation proc: scheme: open "new-context", share addr_space + files via kdup
waitpid proc: scheme: EVENT_READ on context fd
getpid/gettid proc: scheme: read "attrs" handle
kill/tkill proc: scheme: ForceKill / Interrupt ContextVerb
CPU affinity proc: scheme: write "sched-affinity" handle
Priority proc: scheme: write "attrs" prio field
Signal setup proc: scheme: write "sighandler" + shared Sigcontrol pages
TLS base (FSBASE) proc: scheme: write "regs/env" EnvRegisters

Completely missing syscalls (no number, no handler): clone, fork, vfork, waitpid, wait4, kill, tkill, tgkill, arch_prctl, set_thread_area, set_tid_address, set_robust_list, get_robust_list, sched_setaffinity, sched_getaffinity, sched_setscheduler, sched_getparam, sigaction, sigprocmask, sigpending, sigsuspend, sigtimedwait, timer_create, timer_settime, timer_delete, timerfd_create, getrusage, setrlimit, getrlimit, times

2.5 relibc Pthread Layer

Files: src/pthread/mod.rs, src/sync/*.rs, src/header/pthread/*.rs, src/header/sched/mod.rs, src/ld_so/tcb.rs, src/platform/redox/mod.rs

Verdict: Core pthreads solid, scheduling/robust/PI absent, several POSIX gaps.

Fully Working (futex-backed)

API Group Backend Notes
pthread_create/join/detach/exit redox_rt clone + Waitval Stack via mmap, TLS via Tcb::new()
pthread_cancel/setcancelstate/testcancel SIGRT_RLCT_CANCEL (33) Deferred cancellation only
pthread_mutex_* (normal/recursive/errorcheck) AtomicU32 CAS + futex_wait/wake 3-state: unlocked/locked/waiters
pthread_cond_* Two-counter futex design CLOCK_REALTIME only (monotonic = stub)
pthread_rwlock_* AtomicU32 + futex Reader count + WAITING_WR bit
pthread_barrier_* Mutex + Cond gen_id wrapping counter
pthread_spin_* AtomicI32 CAS No futex, pure spinning
pthread_once 3-state futex (UNINIT→INITING→INIT)
pthread_key_create/getspecific/setspecific/delete BTreeMap global + thread_local values Destructor iteration per POSIX
pthread_sigmask Delegates to sigprocmask
pthread_kill redox_rt::rlct_kill
pthread_atfork Thread-local LinkedList hooks
ELF TLS (__thread / #[thread_local]) PT_TLS + Tcb Static + dynamic DTV for dlopen
pthread_attr_* (getters/setters) RlctAttr struct

Stubs / No-ops / Missing

API Status Root Cause
sched_get_priority_max/min todo!() Kernel has no scheduling policy API
sched_getparam/setparam todo!() Same
sched_setscheduler todo!() Same
sched_rr_get_interval todo!() Same
pthread_setschedparam No-op (returns Ok) Kernel ignores policy
pthread_setschedprio No-op (returns Ok) Kernel ignores priority change
pthread_getschedparam todo!()
pthread_getcpuclockid ENOENT No per-thread CPU clock
pthread_mutex_consistent todo_skip! Robust mutex not implemented
pthread_mutex_getprioceiling todo_skip! Priority ceiling not implemented
pthread_mutex_setprioceiling todo_skip! Same
pthread_mutexattr_setprotocol (PRIO_INHERIT) Accepted, no-op PI futex missing
pthread_mutexattr_setrobust (ROBUST) Accepted, no-op Robust futex missing
pthread_cond_init CLOCK_MONOTONIC todo_skip!
pthread_cond_signal Calls broadcast (wakes ALL) Missing FUTEX_REQUEUE optimization
pthread_setaffinity_np Not defined
pthread_getaffinity_np Not defined
pthread_setname_np Not defined
pthread_getname_np Not defined
pthread_setcanceltype Always returns DEFERRED ASYNC not tracked
Guard pages Attribute stored, not mapped No PROT_NONE page before stack
PTHREAD_KEYS_MAX limit Not checked

3. Gap Classification

3.1 Correctness Gaps (Must Fix — Silent Data Corruption or Deadlock)

# Gap Impact Fix Location
C1 No robust mutexes Thread death while holding mutex → permanent deadlock for all waiters Kernel: robust futex list + relibc: pthread_mutex_consistent
C2 No PI futexes Priority inversion: low-prio thread blocks high-prio thread indefinitely Kernel: FUTEX_LOCK_PI/UNLOCK_PI + relibc: mutexattr_setprotocol
C3 pthread_cond_signal wakes ALL Correctness: wastes CPU. Performance: thundering herd on every signal relibc: use true wake(1) — may need FUTEX_REQUEUE
C4 fork() not thread-safe pthread_atfork hooks exist but child inherits locked mutexes relibc: implement atfork child handlers properly

3.2 Performance Gaps (Must Fix for Desktop Responsiveness)

# Gap Impact Fix Location
P1 No SMP load balancing Cores sit idle while others are overloaded Kernel: work stealing + initial placement
P2 No futex sharding Single global L1 mutex for ALL futex ops on ALL CPUs Kernel: 64-shard hash table
P3 No FUTEX_REQUEUE pthread_cond_broadcast wakes all → thundering herd Kernel: REQUEUE + CMP_REQUEUE
P4 Full TLB flush on every shootdown Per-page mprotect/munmap flushes entire TLB on all cores Kernel: INVLPG-based selective flush
P5 Global context switch lock Serialization bottleneck beyond ~8 cores Kernel: per-CPU context switch (needs per-CPU run queues)
P6 All IRQs to BSP CPU 0 handles all interrupts, cache thrash, latency Kernel: IRQ steering in I/O APIC + MSI/MSI-X dest field
P7 No RT scheduling Audio/compositor threads can't get priority Kernel: SchedPolicy + RT dispatch + relibc: sched_setscheduler

3.3 POSIX Completeness Gaps (Must Fix for Application Compatibility)

# Gap Impact Fix Location
X1 sched_* all todo!() Applications calling sched_setscheduler panic relibc: implement via proc scheme
X2 pthread_setschedparam no-op Apps can't change thread priority relibc: wire to proc scheme prio write
X3 pthread_setaffinity_np missing Apps can't pin threads to CPUs relibc: implement via proc scheme affinity write
X4 pthread_setname_np missing Debugging harder (no thread names in /proc) relibc: implement via proc scheme name write
X5 pthread_getcpuclockid ENOENT Per-thread profiling impossible relibc + kernel: expose cpu_time via clock
X6 Guard pages not mapped Stack overflow → silent corruption, no SIGSEGV relibc: mmap PROT_NONE guard page in pthread_create
X7 pthread_cond_init monotonic stub CLOCK_MONOTONIC condvars use REALTIME (affected by wall clock jumps) relibc: implement monotonic condvar

4. Implementation Plan

Phase 0: Patch Recovery — Re-Apply Lost Threading Work (Week 12)

Goal: Recover the P5P9 work that was lost during the local fork migration.

This is the highest-priority phase — it restores ~6 months of work with minimal new code.

0.1 — Re-apply kernel scheduler patches to local fork

Apply in dependency order to local/sources/kernel/:

Order Patch Status Action
1 P6-futex-sharding applies Commit directly
2 P6-percpu-runqueues applies Commit directly
3 P8-percpu-sched applies Commit directly
4 P8-percpu-wiring applies Commit directly
5 P8-initial-placement applies Commit directly
6 P5-sched-rt-policy applies Commit directly
7 P5-context-mod-sched applies Commit directly
8 P6-vruntime-switch applies Commit directly
9 P7-cache-affine-switch applies Commit directly
10 P9-numa-topology applies Commit directly
11 P9-proc-lock-ordering applies Commit directly
12 P8-work-stealing needs rebase Rebase against 111, then apply
13 P8-futex-requeue needs rebase Rebase against P6-sharding (#1), then apply
14 P8-futex-pi needs rebase Rebase against #13, then apply
15 P8-futex-robust needs rebase Rebase against #14, then apply
16 P9-futex-pi-cas-fix needs rebase Rebase against #14, then apply
17 P7-scheduler-improvements needs rebase Rebase against 111, then apply

Verification after each patch:

cd local/sources/kernel
cargo check  # must pass

0.2 — Re-apply relibc threading patches to local fork

Apply to local/sources/relibc/:

Patch Action
P3-threads.patch applies — commit
P3-barrier-smp-futex (from absorbed/) Verify already in fork; if not, apply
P3-pthread-signal-races (from absorbed/) Verify already in fork
P3-pthread-yield (from absorbed/) Verify already in fork
P5-robust-mutexes (from absorbed/) Verify; re-apply if missing
P5-robust-mutex-enotrec-fix (from absorbed/) Same
P5-sched-api (from absorbed/) Same
P7-pthread-affinity (from absorbed/) Same
P7-pthread-setname (from absorbed/) Same
P7-setpriority (from absorbed/) Same
P9-spin-and-barrier (from absorbed/) Same
P9-spin-fix (from absorbed/) Same
P3-semaphore-comprehensive applies

Verification:

cd local/sources/relibc
make all  # must pass
touch relibc && make prefix  # rebuild prefix with new libc

0.3 — Build and smoke test

export REDBEAR_ALLOW_PROTECTED_FETCH=1
./local/scripts/build-redbear.sh --upstream redbear-mini
make qemu  # verify boot + basic operation

Success criteria: redbear-mini boots, multi-threaded daemons (pcid, xhcid) start, no kernel panic.


Phase 1: Futex Completeness (Week 24)

Goal: Close the futex operation gaps that affect correctness and performance.

Depends on: Phase 0 complete (sharding applied first).

1.1 — FUTEX_REQUEUE + FUTEX_CMP_REQUEUE

Kernel: src/syscall/futex.rs

  • Add FUTEX_REQUEUE and FUTEX_CMP_REQUEUE to the futex dispatcher
  • Implement: move up to val waiters from addr1 → addr2, optionally compare *addr1 == val2
  • Requires locking TWO shards (acquire both in deterministic order to avoid deadlock)

relibc: src/sync/cond.rs

  • Change pthread_cond_broadcast to use FUTEX_REQUEUE (move waiters from condvar futex to mutex futex)
  • Change pthread_cond_signal to wake exactly 1 (not all)

Impact: Eliminates thundering herd on every pthread_cond_broadcast. Major win for Qt event loop, KWin compositor, Mesa worker threads.

1.2 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI)

Kernel: src/syscall/futex.rs

  • Add PiState tracking per futex: owner context + waiter list with priorities
  • On LOCK_PI block: boost owner's priority to waiter's priority
  • On UNLOCK_PI: restore original priority, wake highest-priority waiter
  • Requires kernel RT scheduling (Phase 0.1 #67: P5-sched-rt-policy)

relibc: src/sync/pthread_mutex.rs

  • Implement PTHREAD_PRIO_INHERIT protocol path using PI futex
  • Replace todo_skip! in pthread_mutex_consistent with real implementation

1.3 — Robust Futex List

Kernel: src/syscall/futex.rs + src/context/context.rs

  • Add robust_list_head: Option<usize> to Context struct
  • Implement set_robust_list / get_robust_list via proc scheme or syscall
  • On thread exit (exit_this_context): walk robust list, set FUTEX_OWNER_DIED bit, wake one waiter with EOWNERDEAD

relibc: src/sync/pthread_mutex.rs

  • Implement robust list registration in pthread_mutex_lock
  • Implement pthread_mutex_consistent: clear EOWNERDEAD state
  • Replace todo_skip! with real implementation

1.4 — FUTEX_WAKE_OP

Kernel: src/syscall/futex.rs

  • Implement atomic op + wake: perform op on addr2, then wake up to val waiters on addr1
  • Operations: set, add, or, andn, xor, with comparison condition

Impact: glibc mutex fast path optimization. Not critical for relibc but helps ported glibc-linked binaries.


Phase 2: SMP Scheduling Quality (Week 36)

Goal: Make multi-core actually distribute work.

Depends on: Phase 0 complete (per-CPU queues applied).

2.1 — Work stealing (recover + fix)

Kernel: src/context/switch.rs

  • On select_next_context() empty local queue: steal from victim CPU
  • Pick victim by round-robin, steal highest-priority runnable context
  • Limit steal batch size (12 contexts per steal attempt)
  • Send IpiKind::Wakeup to target CPU if stealing woke it from idle

Recovery: P8-work-stealing needs rebase against per-CPU wiring.

2.2 — Load balancing (recover + verify)

Kernel: src/context/switch.rs

  • Periodic balance trigger (every N ticks or when queue depth difference > threshold)
  • Migrate contexts from overloaded CPU to most-idle CPU
  • Respect sched_affinity mask during migration

Recovery: P8-load-balance is in absorbed/ — verify it's in the fork after Phase 0.

2.3 — Reschedule IPI

Kernel: src/arch/x86_shared/ipi.rs + src/context/switch.rs

  • When waking a context on a different CPU, send IpiKind::Switch to that CPU
  • Currently the Switch IPI exists but is not used by the scheduler

2.4 — Per-page TLB flush (INVLPG)

Kernel: rmm/src/arch/x86_64.rs + src/context/memory.rs

  • Add invalidate_page(addr) using invlpg instruction
  • Modify Flusher to track individual pages and use INVLPG when ≤ N pages affected
  • Fall back to CR3 reload only for large-scale invalidations

Impact: Every mprotect/mmap/munmap on a multi-threaded process currently flushes the ENTIRE TLB on every core. This is one of the most impactful single fixes.

2.5 — TLB broadcast optimization

Kernel: src/percpu.rs

  • Replace per-CPU sequential shootdown_tlb_ipi(Some(id)) loop with ICR "all excluding self" (destination shorthand 0b11)
  • Single IPI + global ack counter instead of N individual IPIs + N ack counters

Phase 3: RT Scheduling (Week 46)

Goal: Allow applications to request real-time scheduling for latency-sensitive threads.

Depends on: Phase 0 (SchedPolicy applied) + Phase 2 (per-CPU queues).

3.1 — Kernel RT scheduling dispatch

Kernel: src/context/switch.rs (from P5-sched-rt-policy — recovered in Phase 0)

  • select_next_context() passes:
    1. SCHED_FIFO contexts (highest RT priority first, no preemption within same prio)
    2. SCHED_RR contexts (highest RT priority first, round-robin within same prio)
    3. SCHED_OTHER contexts (existing DWRR/vruntime)
  • SCHED_RR quantum: configurable per-context (default 100ms)

3.2 — relibc sched_* API completion

relibc: src/header/sched/mod.rs

Replace ALL todo!() stubs:

Function Implementation
sched_getscheduler(pid) Read policy from proc scheme attrs
sched_setscheduler(pid, policy, param) Write policy + RT priority via proc scheme
sched_getparam(pid, param) Read RT priority from proc scheme
sched_setparam(pid, param) Write RT priority via proc scheme
sched_get_priority_max(policy) Return 99 for FIFO/RR, 0 for OTHER
sched_get_priority_min(policy) Return 1 for FIFO/RR, 0 for OTHER
sched_rr_get_interval(pid, tp) Return SCHED_RR quantum (100ms default)

3.3 — pthread_setschedparam wiring

relibc: src/pthread/mod.rs

  • Replace set_sched_param no-op with real proc scheme call
  • Replace set_sched_priority no-op with real proc scheme call

Phase 4: POSIX Pthread Completeness (Week 58)

Goal: Close remaining POSIX gaps that block application compatibility.

Depends on: Phase 0 + Phase 3 (for sched API).

4.1 — pthread_setaffinity_np / pthread_getaffinity_np

relibc: src/header/pthread/mod.rs + src/header/sched/mod.rs

  • Implement using proc scheme "sched-affinity" write/read
  • Define cpu_set_t type and CPU_SET/CPU_CLR/CPU_ZERO/CPU_ISSET macros

4.2 — pthread_setname_np / pthread_getname_np

relibc: src/header/pthread/mod.rs

  • Implement using proc scheme name write/read (kernel already supports 32-char name field)

4.3 — pthread_cond_init CLOCK_MONOTONIC

relibc: src/sync/cond.rs

  • Replace todo_skip! with real monotonic clock support
  • Store clock choice in cond struct, use CLOCK_MONOTONIC for deadline calculations

4.4 — Guard pages

relibc: src/pthread/mod.rs

  • In pthread_create, when allocating stack via mmap:
    • Map [stack_base, stack_base + guard_size) with PROT_NONE
    • Map [stack_base + guard_size, stack_base + guard_size + stack_size) with PROT_READ | PROT_WRITE
  • On thread exit, munmap both regions

4.5 — pthread_getcpuclockid

relibc: src/header/pthread/mod.rs

  • Return CLOCK_THREAD_CPUTIME_ID (requires kernel support — add clock to clock_gettime)

Kernel: src/syscall/time.rs

  • Add CLOCK_THREAD_CPUTIME_ID → read context.cpu_time

4.6 — PTHREAD_KEYS_MAX enforcement

relibc: src/header/pthread/tls.rs

  • Check NEXTKEY against PTHREAD_KEYS_MAX (1024) before allocating

Phase 5: IRQ Steering and NUMA (Week 812)

Goal: Distribute interrupt load and respect memory locality.

Depends on: Phase 2 (per-CPU infrastructure).

5.1 — IRQ steering

Kernel: src/arch/x86_shared/device/ioapic.rs + src/arch/x86_shared/idt.rs

  • Change I/O APIC redirection dest from bsp_apic_id to round-robin or RSS hash
  • Add per-CPU legacy IRQ handlers in IDT (not just BSP)
  • For MSI/MSI-X: set destination CPU in Message Address register

5.2 — NUMA topology discovery

Kernel: src/acpi/ (from P9-numa-topology — recovered in Phase 0)

  • Parse SRAT (Static Resource Affinity Table) for proximity domains
  • Parse SLIT (System Locality Distance Information Table) for inter-node distances
  • Store NumaTopology in kernel for O(1) scheduling lookups

5.3 — NUMA-aware memory allocation

Kernel: src/memory/ + frame allocator

  • Track frame NUMA node in Frame or PageInfo
  • On allocation, prefer frames from requesting CPU's NUMA node
  • Fallback to remote node when local node is exhausted

5. Dependency Chain

Phase 0 (Patch Recovery) ← BLOCKING FOR ALL OTHERS
    │
    ├──► Phase 1 (Futex Completeness)
    │       │
    │       ├──► 1.1 REQUEUE ──► condvar performance
    │       ├──► 1.2 PI ──► priority inversion fix (needs Phase 3.1)
    │       ├──► 1.3 Robust ──► deadlock prevention
    │       └──► 1.4 WAKE_OP ──► glibc compat
    │
    ├──► Phase 2 (SMP Scheduling)
    │       │
    │       ├──► 2.1 Work stealing ──► core utilization
    │       ├──► 2.2 Load balancing ──► fair distribution
    │       ├──► 2.3 Reschedule IPI ──→ cross-CPU wakeup
    │       ├──► 2.4 Per-page TLB ──► mmap/mprotect performance
    │       └──► 2.5 TLB broadcast ──► IPI efficiency
    │
    ├──► Phase 3 (RT Scheduling)
    │       │
    │       ├──► 3.1 Kernel RT dispatch (from Phase 0)
    │       ├──► 3.2 relibc sched_* API ──► POSIX compat
    │       └──► 3.3 pthread_setschedparam ──► app priority control
    │
    ├──► Phase 4 (POSIX Pthread Completeness)
    │       │
    │       ├──► 4.1 Affinity API ──► CPU pinning
    │       ├──► 4.2 Thread naming ──► debuggability
    │       ├──► 4.3 Monotonic condvar ──► clock correctness
    │       ├──► 4.4 Guard pages ──► stack overflow detection
    │       ├──► 4.5 CPU clock ──► per-thread profiling
    │       └──► 4.6 Keys max ──► resource limit
    │
    └──► Phase 5 (IRQ + NUMA)
            │
            ├──► 5.1 IRQ steering ──► interrupt distribution
            ├──► 5.2 NUMA topology ──► (from Phase 0)
            └──► 5.3 NUMA allocator ──► memory locality

Parallel work possible:

  • Phase 1 + Phase 2 + Phase 3 can run in parallel after Phase 0
  • Phase 4 items are independent of each other
  • Phase 5 depends on Phase 2 but not on Phase 1/3/4

6. Validation Plan

6.1 Build Evidence

Check Command
Kernel compiles make r.kernel
relibc compiles make r.relibc
Prefix rebuilt touch relibc kernel && make prefix
Full OS builds make all CONFIG_NAME=redbear-mini

6.2 Runtime Evidence (QEMU)

Test Verification
Multi-threaded boot make qemu QEMUFLAGS="-smp 4" — all 4 CPUs active
pthread smoke test Guest: compile + run simple pthread_create/join/mutex test
Work stealing Guest: spawn 8 threads on 4-CPU QEMU, verify all CPUs utilized
Futex REQUEUE Guest: condvar broadcast benchmark — waiters wake in ≤2 batches, not N
PI futex Guest: priority inversion test — high-prio thread unblocked within 1 tick
Robust mutex Guest: kill thread holding mutex, verify EOWNERDEAD recovery
RT scheduling Guest: SCHED_FIFO thread preempts SCHED_OTHER within 100μs
CPU affinity Guest: pin thread to CPU 1, verify it never runs on CPU 0
Thread naming Guest: cat /scheme/proc/*/name shows set names
Guard pages Guest: overflow stack, verify SIGSEGV (not silent corruption)
TLB efficiency Guest: mprotect benchmark — compare TLB miss rate before/after

6.3 Validation Scripts (to create)

local/scripts/test-threading-qemu.sh          # Comprehensive threading smoke test
local/scripts/test-futex-requeue-qemu.sh      # REQUEUE-specific test
local/scripts/test-futex-pi-qemu.sh           # PI futex test
local/scripts/test-futex-robust-qemu.sh       # Robust mutex test
local/scripts/test-sched-rt-qemu.sh           # RT scheduling latency test
local/scripts/test-sched-balance-qemu.sh      # Load balancing on multi-vCPU
local/scripts/test-threading-baremetal.sh     # Bare metal multi-threaded stress

7. Estimated Effort

Phase Duration New Code Recovery Dependencies
Phase 0: Patch Recovery 12 weeks Minimal (rebase 5 patches) 13 patches apply directly None
Phase 1: Futex Completeness 23 weeks REQUEUE impl + WAKE_OP PI/robust from P8 patches Phase 0
Phase 2: SMP Scheduling 34 weeks TLB INVLPG + broadcast opt Work stealing from P8 Phase 0
Phase 3: RT Scheduling 12 weeks relibc sched_* API RT dispatch from P5 Phase 0
Phase 4: POSIX Pthread 23 weeks Affinity/naming/guard/clock Partial from P7 patches Phase 0, 3
Phase 5: IRQ + NUMA 34 weeks IRQ steering + NUMA allocator NUMA topology from P9 Phase 0, 2

Total: 1218 weeks with 12 developers. Phase 0 alone recovers the majority of the value in 12 weeks.


8. Integration with Existing Plans

Plan Relationship
CONSOLE-TO-KDE-DESKTOP-PLAN.md Consumer — Phase 3 (KWin) needs PI futex + RT scheduling; Phase 2 (compositor) needs work stealing
IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md Sibling — IRQ steering (Phase 5.1) belongs to both plans
DRM-MODERNIZATION-EXECUTION-PLAN.md Consumer — GPU worker threads benefit from load balancing + affinity
IMPLEMENTATION-MASTER-PLAN.md Parent — this plan covers the kernel threading substrate
CPU-DMA-IRQ-MSI-SCHEDULER-FIX-PLAN.md Sibling — overlaps on scheduler/IRQ delivery

9. Bottom Line

The Red Bear OS threading stack is functional for basic single-threaded and lightly-threaded workloads. The SMP boot, context switching, TLB shootdown, and basic futex operations are correct.

The critical problem is that 6 months of threading enhancement work (P5P9 patches) was lost during the local fork migration. This work exists as patch files that apply cleanly to the current fork — Phase 0 (Patch Recovery) is the single highest-ROI action.

After Phase 0, the remaining gaps are:

  1. Futex REQUEUE/PI/robust — for condvar performance and deadlock prevention
  2. SMP work stealing + load balancing — for multi-core utilization
  3. RT scheduling — for audio/compositor thread priority
  4. POSIX pthread completeness — for application compatibility
  5. IRQ steering + NUMA — for multi-socket performance

The desktop-critical path (KWin responsiveness) requires Phases 03. The server-critical path (multi-socket, NUMA) adds Phase 5. Phase 4 (POSIX completeness) benefits all paths but is not desktop-blocking.