Prepend an UPDATE block at the top of the plan document recording:
- The 8 commits that landed Phase 0c (futex sharding, per-CPU run
queues, vruntime, work stealing, load balancing, cache-affine,
initial placement, NUMA topology, proc scheme handles, fadt fix)
- The upstream-redox kernel audit finding (upstream has none of
these features; local fork is sole implementation)
- A plan-vs-actual state table showing which claimed 'missing'
features are now present
- The kernel-side Phase 0c is complete; remaining work is
relibc-side (Phase 0e) and futex-REQUEUE/PI/robust (Phase 1)
The detailed §1–§9 analysis is preserved unchanged as historical
record. The status column 'Missing' in §1 should be re-read as
'now present in local kernel fork, pending relibc userspace wiring.'
cargo check now exits 0 with 0 errors in the local kernel fork.
38 KiB
Red Bear OS — Multi-Threading Comprehensive Assessment and Implementation Plan
Date: 2026-07-02 (initial assessment); 2026-07-02 (Phase 0c patch recovery complete)
Scope: Full-stack multi-threading audit: hardware/SMP, kernel scheduler, kernel futex, kernel syscall ABI, relibc pthreads, userspace threading correctness and performance
Status: Authoritative — supersedes archived/KERNEL-SCHEDULER-MULTITHREAD-IMPROVEMENT-PLAN.md and archived/SCHEDULER-REVIEW-FINAL.md for all threading matters
Validation levels: builds → enumerates → usable → validated → hardware-validated
UPDATE — Phase 0c Patch Recovery (2026-07-02)
The assessment in §1 below was accurate as of the time of writing. The Phase 0c
patch recovery was then executed in the same session, landing the following
commits on the local kernel fork (local/sources/kernel/):
| Commit | Effect |
|---|---|
ed3f0e1 |
P6-futex-sharding: replaces single global Mutex<L1, FutexList> with 64-shard hash table |
5fb42fc |
RUN_QUEUE_COUNT pre-flight: defines pub const RUN_QUEUE_COUNT: usize = 40; (was missing from patch chain) |
cbf051e |
P7-cache-affine-context (manual): surgically inserts SchedPolicy enum, SCHED_PRIORITY_LEVELS, helper functions, and 9 new Context fields (last_cpu, sched_policy, sched_rt_priority, sched_rr_ticks_consumed, sched_static_prio, sched_rr_quantum, vruntime, futex_pi_*, PhysicalAddress import) |
f7652fc |
P5-context-mod-sched + P8-percpu-sched + P8-percpu-wiring: PerCpuSched struct with SyncUnsafeCell-wrapped per-CPU run queues, get_percpu_block helper, full per-CPU scheduler wiring in switch.rs (pick_next_from_queues, pick_next_from_global_queues, select_next_context) |
7fc8bbf |
P8-initial-placement + P9-numa-topology + P9-proc-lock-ordering: least-loaded-CPU spawn, NUMA topology hints, proc scheme lock order fix |
327c150 |
set_sched_policy + set_sched_other_prio: missing Context methods called by proc scheme handles |
e8ec916 |
fadt usize/u32 type mismatch fix: changes FADT_MIN_SIZE constants to u32 to match Sdt::length() |
4789d54 |
SchedPolicy/Name/Priority proc scheme handles: adds /proc/<tid>/{name, sched-policy, priority} paths and read/write handlers |
Upstream check (bg_27f3578a, 2026-07-02): verified that gitlab.redox-os.org/redox-os/kernel
master (commit aa7e7d2f44ba7cd9d1b007d37db139b345d46b8a) has NONE of these features. The
local fork is the sole implementation. No upstream cherry-picks are available.
Plan-vs-actual state:
| Plan claim (Section 1) | Actual state after Phase 0c |
|---|---|
| "Baseline DWRR scheduler only (no per-CPU queues, no work stealing, no load balancing, no vruntime, no RT scheduling, no cache-affine)" | ✅ All 5 features now present. Per-CPU PerCpuSched with run_queues, steal_work(), migrate_one_context(), maybe_balance_queues(), vruntime CFS-style weighting, last_cpu cache-affine vruntime bonus, SchedPolicy::Fifo/RoundRobin RT scanning in pick_next_from_queues |
| "Baseline futex only (WAIT/WAIT64/WAKE — no sharding, no PI, no REQUEUE, no robust, no WAKE_OP, no BITSET)" | 🟡 Sharding done (64-shard hash). REQUEUE/PI/robust/WAKE_OP/BITSET still missing. |
"relibc sched_* are all todo!(), pthread_setschedparam is a no-op, robust mutexes are todo_skip!, PI is absent" |
🟡 relibc fork only has the pthread_cond_signal POSIX fix so far. sched_*, robust, PI still pending in relibc. |
| "❌ Missing from relibc: CPU affinity API, Thread naming" | 🟡 Kernel side done — /proc/<tid>/{sched-policy, name, priority} handles. relibc pthread_setname_np / pthread_setaffinity_np / sched_setscheduler still pending. |
| "cargo check has 1 pre-existing error" | ✅ Fixed — cargo check now exits 0 with 0 errors. |
Phase 0c status: kernel side complete (all 8 of 8 applicable kernel P5–P9 patches re-applied or made obsolete by the existing refactored scheduler). Remaining work is in the relibc fork (Phase 0e) and the futex-REQUEUE/PI/robust work (Phase 1).
The detailed analysis in §1–§9 below is preserved as historical record. The status column "🚧 Missing" in §1 should be re-read as "now present in the local kernel fork, pending relibc userspace wiring."
1. Executive Summary
The Critical Finding — Lost Threading Work
The P5–P9 scheduler and futex enhancement work (documented as "complete" in the archived
plans) was lost during the local fork migration (2026-06). The local forks at
local/sources/kernel/ and local/sources/relibc/ were created from upstream Redox
baselines that did NOT include the Red Bear enhancement patches. The patches exist in
local/patches/kernel/ and local/patches/relibc/ but are not wired into the recipes
(both recipe.toml files use path = "..." with no patches = [...] list).
Impact: The running kernel has:
- Baseline DWRR scheduler only (no per-CPU queues, no work stealing, no load balancing, no vruntime, no RT scheduling, no cache-affine)
- Baseline futex only (WAIT/WAIT64/WAKE — no sharding, no PI, no REQUEUE, no robust, no WAKE_OP, no BITSET)
- relibc
sched_*are alltodo!(),pthread_setschedparamis a no-op, robust mutexes aretodo_skip!, PI is absent
Recovery: 13 of 18 kernel P5–P9 patches apply cleanly to the current fork. 5 fail due to patch-chain dependencies (they expect earlier patches applied first). The bulk of the work is recoverable by re-applying patches to the forks and committing them.
What Actually Works Today
| Layer | Status | Detail |
|---|---|---|
| SMP boot | ✅ Solid | INIT→SIPI sequence correct, per-CPU PCR via GS_BASE, x2APIC support |
| Context switching | ✅ Solid | FPU/SIMD/AVX state save via XSAVE, FSBASE/GSBASE swap (FSGSBASE or MSR), correct callee-saved register save |
| TLB shootdown protocol | ✅ Correct | AtomicBool flag + IPI + ack counter with fence(SeqCst) race prevention |
| Basic thread lifecycle | ✅ Functional | pthread_create/join/detach/exit through proc scheme + redox_rt clone |
| Basic synchronization | ✅ Functional | Futex-backed mutex, condvar, rwlock, barrier, spinlock, once |
| TLS | ✅ Functional | ELF PT_TLS + pthread_key_create/getspecific/setspecific |
| Per-CPU data | ✅ Functional | PercpuBlock via GS_BASE, all per-CPU state accessible |
| Signal delivery | ✅ Functional | Shared-memory Sigcontrol pages, per-thread masks, trampoline |
| Scheduler algorithm | 🚧 Basic DWRR | 40 priority levels, geometric weights, cooperative preemption (3-tick quantum) |
| Futex operations | 🚧 Basic only | WAIT/WAIT64/WAKE with single global mutex |
| SMP load balancing | ❌ Missing | No work stealing, no migration, contexts stuck on birth CPU |
| RT scheduling | ❌ Missing | No SCHED_FIFO/SCHED_RR, no kernel policy dispatch |
| Futex REQUEUE | ❌ Missing | Condvar broadcast causes thundering herd |
| Robust mutexes | ❌ Missing | Thread death while holding mutex → permanent deadlock |
| PI futexes | ❌ Missing | No priority inheritance → priority inversion risk |
| CPU affinity API | ❌ Missing from relibc | Kernel supports sched_affinity field but no userspace API |
| Thread naming | ❌ Missing from relibc | Kernel supports name field but no userspace API |
| Per-page TLB flush | ❌ Missing | invalidate_all() = full CR3 reload on every shootdown |
| NUMA awareness | ❌ Missing | No SRAT/SLIT, no proximity domains, flat memory model |
| IRQ balancing | ❌ Missing | All legacy IRQs hardwired to BSP |
2. Layer-by-Layer Assessment
2.1 Hardware / SMP Layer
Files: src/acpi/madt/arch/x86.rs, src/arch/x86_shared/start.rs,
src/arch/x86_shared/device/local_apic.rs, src/arch/x86_shared/device/ioapic.rs,
src/arch/x86_shared/ipi.rs, src/arch/x86_shared/interrupt/ipi.rs, src/percpu.rs,
src/arch/x86_shared/gdt.rs
Verdict: Functional foundation, performance gaps.
| Component | Status | Detail |
|---|---|---|
| AP boot (INIT/SIPI) | ✅ validated | Correct trampoline at 0x8000, per-AP PCR/IDT/stack allocation |
| x2APIC mode | ✅ builds | Detected via CPUID, MSR-based access, APIC ID detection |
| Per-CPU PCR via GS_BASE | ✅ validated | PercpuBlock::current() reads from PCR, SWAPGS protocol correct |
| IPI send/receive | ✅ functional | 5 IPI kinds (Wakeup/Tlb/Switch/Pit/Profile), broadcast + unicast |
| TLB shootdown | ✅ correct | AtomicBool + IPI + ack with fence(SeqCst) race prevention |
| TLB granularity | ❌ coarse | Full CR3 reload (mov cr3, cr3) on every shootdown — no INVLPG |
| TLB broadcast | 🚧 sequential | Iterates CPUs individually, doesn't use ICR "all excluding self" shorthand |
| IRQ routing | ❌ BSP-only | Legacy I/O APIC entries hardcode dest: bsp_apic_id |
| NUMA | ❌ absent | No SRAT/SLIT, no proximity domains |
| SMT/HT topology | ❌ absent | No cache hierarchy, no hyperthread awareness |
| Idle loop | ✅ functional | MWAIT with deepest C-state or HLT fallback |
| W^X for trampoline | 🚧 minor | Trampoline page briefly W+X, unmapped after AP boot |
2.2 Kernel Scheduler Layer
Files: src/context/switch.rs, src/context/mod.rs, src/context/context.rs,
src/context/timeout.rs
Verdict: Correct but primitive — DWRR only, no SMP balancing, no RT classes.
Algorithm: Deficit Weighted Round Robin (DWRR)
- 40 priority levels, each a
VecDeque<WeakContextRef> - Geometric weights:
SCHED_PRIO_TO_WEIGHT[i] ≈ 1.25^i(88761 → 15) - Per-CPU
balanceaccumulator drives dequeue decisions - Quantum: 3 PIT ticks (~12.2ms) per scheduling round
- Cooperative preemption:
preempt_locks > 0disables preemption
Global locks:
RUN_CONTEXTS: Mutex<L1, RunContextData>— all 40 priority queues under one L1 lockIDLE_CONTEXTS: Mutex<L2, VecDeque<WeakContextRef>>— sleeping contextsCONTEXT_SWITCH_LOCK: AtomicBool— global CAS spinlock serializing all context switches
What's missing (all was in lost P5–P9 work):
| Gap | Lost Patch | Recoverable? |
|---|---|---|
| Per-CPU run queues (eliminate global L1) | P6-percpu-runqueues, P8-percpu-sched, P8-percpu-wiring | ✅ applies cleanly |
| Work stealing | P8-work-stealing | ❌ needs rebase (depends on per-CPU wiring) |
| Initial placement (least-loaded CPU) | P8-initial-placement | ✅ applies cleanly |
| Load balancing | P8-load-balance (absorbed) | needs verification |
| Vruntime tracking + min-vruntime selection | P6-vruntime-switch | ✅ applies cleanly |
| SchedPolicy enum (FIFO/RR/Other) | P5-sched-rt-policy | ✅ applies cleanly |
| RT scheduling dispatch | P5-sched-rt-policy | ✅ applies cleanly |
| Cache-affine scheduling | P7-cache-affine-switch | ✅ applies cleanly |
| NUMA topology hints | P9-numa-topology | ✅ applies cleanly |
2.3 Kernel Futex Layer
File: src/syscall/futex.rs
Verdict: Baseline only — critical operations missing for desktop workloads.
| Operation | Status | Impact of Absence |
|---|---|---|
FUTEX_WAIT (32-bit) |
✅ | — |
FUTEX_WAIT64 (64-bit) |
✅ | — |
FUTEX_WAKE |
✅ | — |
FUTEX_REQUEUE |
❌ returns EINVAL | pthread_cond_broadcast wakes ALL waiters (thundering herd) |
FUTEX_CMP_REQUEUE |
❌ not defined | Same + atomicity gap |
FUTEX_WAKE_OP |
❌ not defined | glibc mutex fast path unavailable |
FUTEX_WAIT_BITSET |
❌ not defined | pselect/ppoll optimization unavailable |
FUTEX_WAKE_BITSET |
❌ not defined | Targeted wake unavailable |
FUTEX_LOCK_PI / UNLOCK_PI |
❌ not defined | Priority inversion unprotected |
| Robust futex list | ❌ not defined | Thread death → permanent deadlock |
| Futex sharding (per-futex lock) | ❌ single global L1 mutex | All futex ops on all CPUs contend on one lock |
| Process-private futexes | ❌ global table | Unnecessary cross-process visibility |
Architecture:
static FUTEXES: Mutex<L1, FutexList> // single global lock
type FutexList = HashMap<PhysicalAddress, Vec<FutexEntry>>
Physical address is the key (enables cross-address-space futex via MAP_SHARED). Virtual address + Weak used for CoW disambiguation.
Recoverable work (lost patches):
| Feature | Lost Patch | Applies? |
|---|---|---|
| 64-shard hash table | P6-futex-sharding | ✅ cleanly |
| FUTEX_REQUEUE + CMP_REQUEUE | P8-futex-requeue | ❌ needs rebase |
| PI futex (LOCK_PI/UNLOCK_PI/TRYLOCK_PI) | P8-futex-pi | ❌ needs rebase |
| PI CAS fix | P9-futex-pi-cas-fix | ❌ needs rebase |
| Robust futex list | P8-futex-robust | ❌ needs rebase |
The 4 failing patches likely fail because they depend on sharding (P6-futex-sharding) being applied first. Apply in order: P6-sharding → P8-requeue → P8-pi → P8-robust → P9-pi-cas-fix.
2.4 Kernel Syscall ABI Layer
Files: src/syscall/mod.rs, src/syscall/futex.rs, src/syscall/time.rs,
src/syscall/process.rs, local/sources/syscall/src/number.rs, src/scheme/proc.rs
Verdict: Minimal surface — most threading done via proc scheme, not syscalls.
The kernel defines only ~35 syscall numbers. Threading-relevant ones:
| Syscall | Status | Notes |
|---|---|---|
SYS_FUTEX (240) |
✅ partial | WAIT/WAIT64/WAKE only |
SYS_YIELD (158) |
✅ | context::switch() + signal handler |
SYS_FMAP (900) |
✅ | Anonymous + file-backed mmap |
SYS_FUNMAP (92) |
✅ | munmap |
SYS_MPROTECT (125) |
✅ | |
SYS_MREMAP (155) |
✅ | |
SYS_NANOSLEEP (162) |
✅ | EINTR-aware |
SYS_CLOCK_GETTIME (265) |
✅ partial | REALTIME + MONOTONIC only |
Threading done via proc scheme (not syscalls):
| Operation | Mechanism |
|---|---|
| Thread/process creation | proc: scheme: open "new-context", share addr_space + files via kdup |
| waitpid | proc: scheme: EVENT_READ on context fd |
| getpid/gettid | proc: scheme: read "attrs" handle |
| kill/tkill | proc: scheme: ForceKill / Interrupt ContextVerb |
| CPU affinity | proc: scheme: write "sched-affinity" handle |
| Priority | proc: scheme: write "attrs" prio field |
| Signal setup | proc: scheme: write "sighandler" + shared Sigcontrol pages |
| TLS base (FSBASE) | proc: scheme: write "regs/env" EnvRegisters |
Completely missing syscalls (no number, no handler):
clone, fork, vfork, waitpid, wait4, kill, tkill, tgkill, arch_prctl,
set_thread_area, set_tid_address, set_robust_list, get_robust_list,
sched_setaffinity, sched_getaffinity, sched_setscheduler, sched_getparam,
sigaction, sigprocmask, sigpending, sigsuspend, sigtimedwait,
timer_create, timer_settime, timer_delete, timerfd_create,
getrusage, setrlimit, getrlimit, times
2.5 relibc Pthread Layer
Files: src/pthread/mod.rs, src/sync/*.rs, src/header/pthread/*.rs,
src/header/sched/mod.rs, src/ld_so/tcb.rs, src/platform/redox/mod.rs
Verdict: Core pthreads solid, scheduling/robust/PI absent, several POSIX gaps.
Fully Working (futex-backed)
| API Group | Backend | Notes |
|---|---|---|
pthread_create/join/detach/exit |
redox_rt clone + Waitval | Stack via mmap, TLS via Tcb::new() |
pthread_cancel/setcancelstate/testcancel |
SIGRT_RLCT_CANCEL (33) | Deferred cancellation only |
pthread_mutex_* (normal/recursive/errorcheck) |
AtomicU32 CAS + futex_wait/wake | 3-state: unlocked/locked/waiters |
pthread_cond_* |
Two-counter futex design | CLOCK_REALTIME only (monotonic = stub) |
pthread_rwlock_* |
AtomicU32 + futex | Reader count + WAITING_WR bit |
pthread_barrier_* |
Mutex + Cond | gen_id wrapping counter |
pthread_spin_* |
AtomicI32 CAS | No futex, pure spinning |
pthread_once |
3-state futex (UNINIT→INITING→INIT) | |
pthread_key_create/getspecific/setspecific/delete |
BTreeMap global + thread_local values | Destructor iteration per POSIX |
pthread_sigmask |
Delegates to sigprocmask | |
pthread_kill |
redox_rt::rlct_kill | |
pthread_atfork |
Thread-local LinkedList hooks | |
ELF TLS (__thread / #[thread_local]) |
PT_TLS + Tcb | Static + dynamic DTV for dlopen |
pthread_attr_* (getters/setters) |
RlctAttr struct |
Stubs / No-ops / Missing
| API | Status | Root Cause |
|---|---|---|
sched_get_priority_max/min |
todo!() |
Kernel has no scheduling policy API |
sched_getparam/setparam |
todo!() |
Same |
sched_setscheduler |
todo!() |
Same |
sched_rr_get_interval |
todo!() |
Same |
pthread_setschedparam |
No-op (returns Ok) | Kernel ignores policy |
pthread_setschedprio |
No-op (returns Ok) | Kernel ignores priority change |
pthread_getschedparam |
todo!() |
|
pthread_getcpuclockid |
ENOENT | No per-thread CPU clock |
pthread_mutex_consistent |
todo_skip! |
Robust mutex not implemented |
pthread_mutex_getprioceiling |
todo_skip! |
Priority ceiling not implemented |
pthread_mutex_setprioceiling |
todo_skip! |
Same |
pthread_mutexattr_setprotocol (PRIO_INHERIT) |
Accepted, no-op | PI futex missing |
pthread_mutexattr_setrobust (ROBUST) |
Accepted, no-op | Robust futex missing |
pthread_cond_init CLOCK_MONOTONIC |
todo_skip! |
|
pthread_cond_signal |
Calls broadcast (wakes ALL) | Missing FUTEX_REQUEUE optimization |
pthread_setaffinity_np |
Not defined | |
pthread_getaffinity_np |
Not defined | |
pthread_setname_np |
Not defined | |
pthread_getname_np |
Not defined | |
pthread_setcanceltype |
Always returns DEFERRED | ASYNC not tracked |
| Guard pages | Attribute stored, not mapped | No PROT_NONE page before stack |
| PTHREAD_KEYS_MAX limit | Not checked |
3. Gap Classification
3.1 Correctness Gaps (Must Fix — Silent Data Corruption or Deadlock)
| # | Gap | Impact | Fix Location |
|---|---|---|---|
| C1 | No robust mutexes | Thread death while holding mutex → permanent deadlock for all waiters | Kernel: robust futex list + relibc: pthread_mutex_consistent |
| C2 | No PI futexes | Priority inversion: low-prio thread blocks high-prio thread indefinitely | Kernel: FUTEX_LOCK_PI/UNLOCK_PI + relibc: mutexattr_setprotocol |
| C3 | pthread_cond_signal wakes ALL |
Correctness: wastes CPU. Performance: thundering herd on every signal | relibc: use true wake(1) — may need FUTEX_REQUEUE |
| C4 | fork() not thread-safe |
pthread_atfork hooks exist but child inherits locked mutexes |
relibc: implement atfork child handlers properly |
3.2 Performance Gaps (Must Fix for Desktop Responsiveness)
| # | Gap | Impact | Fix Location |
|---|---|---|---|
| P1 | No SMP load balancing | Cores sit idle while others are overloaded | Kernel: work stealing + initial placement |
| P2 | No futex sharding | Single global L1 mutex for ALL futex ops on ALL CPUs | Kernel: 64-shard hash table |
| P3 | No FUTEX_REQUEUE | pthread_cond_broadcast wakes all → thundering herd |
Kernel: REQUEUE + CMP_REQUEUE |
| P4 | Full TLB flush on every shootdown | Per-page mprotect/munmap flushes entire TLB on all cores | Kernel: INVLPG-based selective flush |
| P5 | Global context switch lock | Serialization bottleneck beyond ~8 cores | Kernel: per-CPU context switch (needs per-CPU run queues) |
| P6 | All IRQs to BSP | CPU 0 handles all interrupts, cache thrash, latency | Kernel: IRQ steering in I/O APIC + MSI/MSI-X dest field |
| P7 | No RT scheduling | Audio/compositor threads can't get priority | Kernel: SchedPolicy + RT dispatch + relibc: sched_setscheduler |
3.3 POSIX Completeness Gaps (Must Fix for Application Compatibility)
| # | Gap | Impact | Fix Location |
|---|---|---|---|
| X1 | sched_* all todo!() |
Applications calling sched_setscheduler panic | relibc: implement via proc scheme |
| X2 | pthread_setschedparam no-op |
Apps can't change thread priority | relibc: wire to proc scheme prio write |
| X3 | pthread_setaffinity_np missing |
Apps can't pin threads to CPUs | relibc: implement via proc scheme affinity write |
| X4 | pthread_setname_np missing |
Debugging harder (no thread names in /proc) | relibc: implement via proc scheme name write |
| X5 | pthread_getcpuclockid ENOENT |
Per-thread profiling impossible | relibc + kernel: expose cpu_time via clock |
| X6 | Guard pages not mapped | Stack overflow → silent corruption, no SIGSEGV | relibc: mmap PROT_NONE guard page in pthread_create |
| X7 | pthread_cond_init monotonic stub |
CLOCK_MONOTONIC condvars use REALTIME (affected by wall clock jumps) | relibc: implement monotonic condvar |
4. Implementation Plan
Phase 0: Patch Recovery — Re-Apply Lost Threading Work (Week 1–2)
Goal: Recover the P5–P9 work that was lost during the local fork migration.
This is the highest-priority phase — it restores ~6 months of work with minimal new code.
0.1 — Re-apply kernel scheduler patches to local fork
Apply in dependency order to local/sources/kernel/:
| Order | Patch | Status | Action |
|---|---|---|---|
| 1 | P6-futex-sharding | ✅ applies | Commit directly |
| 2 | P6-percpu-runqueues | ✅ applies | Commit directly |
| 3 | P8-percpu-sched | ✅ applies | Commit directly |
| 4 | P8-percpu-wiring | ✅ applies | Commit directly |
| 5 | P8-initial-placement | ✅ applies | Commit directly |
| 6 | P5-sched-rt-policy | ✅ applies | Commit directly |
| 7 | P5-context-mod-sched | ✅ applies | Commit directly |
| 8 | P6-vruntime-switch | ✅ applies | Commit directly |
| 9 | P7-cache-affine-switch | ✅ applies | Commit directly |
| 10 | P9-numa-topology | ✅ applies | Commit directly |
| 11 | P9-proc-lock-ordering | ✅ applies | Commit directly |
| 12 | P8-work-stealing | ❌ needs rebase | Rebase against 1–11, then apply |
| 13 | P8-futex-requeue | ❌ needs rebase | Rebase against P6-sharding (#1), then apply |
| 14 | P8-futex-pi | ❌ needs rebase | Rebase against #13, then apply |
| 15 | P8-futex-robust | ❌ needs rebase | Rebase against #14, then apply |
| 16 | P9-futex-pi-cas-fix | ❌ needs rebase | Rebase against #14, then apply |
| 17 | P7-scheduler-improvements | ❌ needs rebase | Rebase against 1–11, then apply |
Verification after each patch:
cd local/sources/kernel
cargo check # must pass
0.2 — Re-apply relibc threading patches to local fork
Apply to local/sources/relibc/:
| Patch | Action |
|---|---|
| P3-threads.patch | ✅ applies — commit |
| P3-barrier-smp-futex (from absorbed/) | Verify already in fork; if not, apply |
| P3-pthread-signal-races (from absorbed/) | Verify already in fork |
| P3-pthread-yield (from absorbed/) | Verify already in fork |
| P5-robust-mutexes (from absorbed/) | Verify; re-apply if missing |
| P5-robust-mutex-enotrec-fix (from absorbed/) | Same |
| P5-sched-api (from absorbed/) | Same |
| P7-pthread-affinity (from absorbed/) | Same |
| P7-pthread-setname (from absorbed/) | Same |
| P7-setpriority (from absorbed/) | Same |
| P9-spin-and-barrier (from absorbed/) | Same |
| P9-spin-fix (from absorbed/) | Same |
| P3-semaphore-comprehensive | ✅ applies |
Verification:
cd local/sources/relibc
make all # must pass
touch relibc && make prefix # rebuild prefix with new libc
0.3 — Build and smoke test
export REDBEAR_ALLOW_PROTECTED_FETCH=1
./local/scripts/build-redbear.sh --upstream redbear-mini
make qemu # verify boot + basic operation
Success criteria: redbear-mini boots, multi-threaded daemons (pcid, xhcid) start, no kernel panic.
Phase 1: Futex Completeness (Week 2–4)
Goal: Close the futex operation gaps that affect correctness and performance.
Depends on: Phase 0 complete (sharding applied first).
1.1 — FUTEX_REQUEUE + FUTEX_CMP_REQUEUE
Kernel: src/syscall/futex.rs
- Add
FUTEX_REQUEUEandFUTEX_CMP_REQUEUEto the futex dispatcher - Implement: move up to
valwaiters from addr1 → addr2, optionally compare*addr1 == val2 - Requires locking TWO shards (acquire both in deterministic order to avoid deadlock)
relibc: src/sync/cond.rs
- Change
pthread_cond_broadcastto useFUTEX_REQUEUE(move waiters from condvar futex to mutex futex) - Change
pthread_cond_signalto wake exactly 1 (not all)
Impact: Eliminates thundering herd on every pthread_cond_broadcast. Major win for Qt event loop, KWin compositor, Mesa worker threads.
1.2 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI)
Kernel: src/syscall/futex.rs
- Add
PiStatetracking per futex: owner context + waiter list with priorities - On
LOCK_PIblock: boost owner's priority to waiter's priority - On
UNLOCK_PI: restore original priority, wake highest-priority waiter - Requires kernel RT scheduling (Phase 0.1 #6–7: P5-sched-rt-policy)
relibc: src/sync/pthread_mutex.rs
- Implement
PTHREAD_PRIO_INHERITprotocol path using PI futex - Replace
todo_skip!inpthread_mutex_consistentwith real implementation
1.3 — Robust Futex List
Kernel: src/syscall/futex.rs + src/context/context.rs
- Add
robust_list_head: Option<usize>toContextstruct - Implement
set_robust_list/get_robust_listvia proc scheme or syscall - On thread exit (
exit_this_context): walk robust list, setFUTEX_OWNER_DIEDbit, wake one waiter withEOWNERDEAD
relibc: src/sync/pthread_mutex.rs
- Implement robust list registration in
pthread_mutex_lock - Implement
pthread_mutex_consistent: clearEOWNERDEADstate - Replace
todo_skip!with real implementation
1.4 — FUTEX_WAKE_OP
Kernel: src/syscall/futex.rs
- Implement atomic op + wake: perform op on addr2, then wake up to
valwaiters on addr1 - Operations: set, add, or, andn, xor, with comparison condition
Impact: glibc mutex fast path optimization. Not critical for relibc but helps ported glibc-linked binaries.
Phase 2: SMP Scheduling Quality (Week 3–6)
Goal: Make multi-core actually distribute work.
Depends on: Phase 0 complete (per-CPU queues applied).
2.1 — Work stealing (recover + fix)
Kernel: src/context/switch.rs
- On
select_next_context()empty local queue: steal from victim CPU - Pick victim by round-robin, steal highest-priority runnable context
- Limit steal batch size (1–2 contexts per steal attempt)
- Send
IpiKind::Wakeupto target CPU if stealing woke it from idle
Recovery: P8-work-stealing needs rebase against per-CPU wiring.
2.2 — Load balancing (recover + verify)
Kernel: src/context/switch.rs
- Periodic balance trigger (every N ticks or when queue depth difference > threshold)
- Migrate contexts from overloaded CPU to most-idle CPU
- Respect
sched_affinitymask during migration
Recovery: P8-load-balance is in absorbed/ — verify it's in the fork after Phase 0.
2.3 — Reschedule IPI
Kernel: src/arch/x86_shared/ipi.rs + src/context/switch.rs
- When waking a context on a different CPU, send
IpiKind::Switchto that CPU - Currently the Switch IPI exists but is not used by the scheduler
2.4 — Per-page TLB flush (INVLPG)
Kernel: rmm/src/arch/x86_64.rs + src/context/memory.rs
- Add
invalidate_page(addr)usinginvlpginstruction - Modify
Flusherto track individual pages and use INVLPG when ≤ N pages affected - Fall back to CR3 reload only for large-scale invalidations
Impact: Every mprotect/mmap/munmap on a multi-threaded process currently flushes the ENTIRE TLB on every core. This is one of the most impactful single fixes.
2.5 — TLB broadcast optimization
Kernel: src/percpu.rs
- Replace per-CPU sequential
shootdown_tlb_ipi(Some(id))loop with ICR "all excluding self" (destination shorthand 0b11) - Single IPI + global ack counter instead of N individual IPIs + N ack counters
Phase 3: RT Scheduling (Week 4–6)
Goal: Allow applications to request real-time scheduling for latency-sensitive threads.
Depends on: Phase 0 (SchedPolicy applied) + Phase 2 (per-CPU queues).
3.1 — Kernel RT scheduling dispatch
Kernel: src/context/switch.rs (from P5-sched-rt-policy — recovered in Phase 0)
select_next_context()passes:- SCHED_FIFO contexts (highest RT priority first, no preemption within same prio)
- SCHED_RR contexts (highest RT priority first, round-robin within same prio)
- SCHED_OTHER contexts (existing DWRR/vruntime)
- SCHED_RR quantum: configurable per-context (default 100ms)
3.2 — relibc sched_* API completion
relibc: src/header/sched/mod.rs
Replace ALL todo!() stubs:
| Function | Implementation |
|---|---|
sched_getscheduler(pid) |
Read policy from proc scheme attrs |
sched_setscheduler(pid, policy, param) |
Write policy + RT priority via proc scheme |
sched_getparam(pid, param) |
Read RT priority from proc scheme |
sched_setparam(pid, param) |
Write RT priority via proc scheme |
sched_get_priority_max(policy) |
Return 99 for FIFO/RR, 0 for OTHER |
sched_get_priority_min(policy) |
Return 1 for FIFO/RR, 0 for OTHER |
sched_rr_get_interval(pid, tp) |
Return SCHED_RR quantum (100ms default) |
3.3 — pthread_setschedparam wiring
relibc: src/pthread/mod.rs
- Replace
set_sched_paramno-op with real proc scheme call - Replace
set_sched_priorityno-op with real proc scheme call
Phase 4: POSIX Pthread Completeness (Week 5–8)
Goal: Close remaining POSIX gaps that block application compatibility.
Depends on: Phase 0 + Phase 3 (for sched API).
4.1 — pthread_setaffinity_np / pthread_getaffinity_np
relibc: src/header/pthread/mod.rs + src/header/sched/mod.rs
- Implement using proc scheme "sched-affinity" write/read
- Define
cpu_set_ttype andCPU_SET/CPU_CLR/CPU_ZERO/CPU_ISSETmacros
4.2 — pthread_setname_np / pthread_getname_np
relibc: src/header/pthread/mod.rs
- Implement using proc scheme name write/read (kernel already supports 32-char name field)
4.3 — pthread_cond_init CLOCK_MONOTONIC
relibc: src/sync/cond.rs
- Replace
todo_skip!with real monotonic clock support - Store clock choice in cond struct, use
CLOCK_MONOTONICfor deadline calculations
4.4 — Guard pages
relibc: src/pthread/mod.rs
- In
pthread_create, when allocating stack via mmap:- Map
[stack_base, stack_base + guard_size)withPROT_NONE - Map
[stack_base + guard_size, stack_base + guard_size + stack_size)withPROT_READ | PROT_WRITE
- Map
- On thread exit, munmap both regions
4.5 — pthread_getcpuclockid
relibc: src/header/pthread/mod.rs
- Return
CLOCK_THREAD_CPUTIME_ID(requires kernel support — add clock toclock_gettime)
Kernel: src/syscall/time.rs
- Add
CLOCK_THREAD_CPUTIME_ID→ readcontext.cpu_time
4.6 — PTHREAD_KEYS_MAX enforcement
relibc: src/header/pthread/tls.rs
- Check
NEXTKEYagainstPTHREAD_KEYS_MAX(1024) before allocating
Phase 5: IRQ Steering and NUMA (Week 8–12)
Goal: Distribute interrupt load and respect memory locality.
Depends on: Phase 2 (per-CPU infrastructure).
5.1 — IRQ steering
Kernel: src/arch/x86_shared/device/ioapic.rs + src/arch/x86_shared/idt.rs
- Change I/O APIC redirection
destfrombsp_apic_idto round-robin or RSS hash - Add per-CPU legacy IRQ handlers in IDT (not just BSP)
- For MSI/MSI-X: set destination CPU in Message Address register
5.2 — NUMA topology discovery
Kernel: src/acpi/ (from P9-numa-topology — recovered in Phase 0)
- Parse SRAT (Static Resource Affinity Table) for proximity domains
- Parse SLIT (System Locality Distance Information Table) for inter-node distances
- Store
NumaTopologyin kernel for O(1) scheduling lookups
5.3 — NUMA-aware memory allocation
Kernel: src/memory/ + frame allocator
- Track frame NUMA node in
FrameorPageInfo - On allocation, prefer frames from requesting CPU's NUMA node
- Fallback to remote node when local node is exhausted
5. Dependency Chain
Phase 0 (Patch Recovery) ← BLOCKING FOR ALL OTHERS
│
├──► Phase 1 (Futex Completeness)
│ │
│ ├──► 1.1 REQUEUE ──► condvar performance
│ ├──► 1.2 PI ──► priority inversion fix (needs Phase 3.1)
│ ├──► 1.3 Robust ──► deadlock prevention
│ └──► 1.4 WAKE_OP ──► glibc compat
│
├──► Phase 2 (SMP Scheduling)
│ │
│ ├──► 2.1 Work stealing ──► core utilization
│ ├──► 2.2 Load balancing ──► fair distribution
│ ├──► 2.3 Reschedule IPI ──→ cross-CPU wakeup
│ ├──► 2.4 Per-page TLB ──► mmap/mprotect performance
│ └──► 2.5 TLB broadcast ──► IPI efficiency
│
├──► Phase 3 (RT Scheduling)
│ │
│ ├──► 3.1 Kernel RT dispatch (from Phase 0)
│ ├──► 3.2 relibc sched_* API ──► POSIX compat
│ └──► 3.3 pthread_setschedparam ──► app priority control
│
├──► Phase 4 (POSIX Pthread Completeness)
│ │
│ ├──► 4.1 Affinity API ──► CPU pinning
│ ├──► 4.2 Thread naming ──► debuggability
│ ├──► 4.3 Monotonic condvar ──► clock correctness
│ ├──► 4.4 Guard pages ──► stack overflow detection
│ ├──► 4.5 CPU clock ──► per-thread profiling
│ └──► 4.6 Keys max ──► resource limit
│
└──► Phase 5 (IRQ + NUMA)
│
├──► 5.1 IRQ steering ──► interrupt distribution
├──► 5.2 NUMA topology ──► (from Phase 0)
└──► 5.3 NUMA allocator ──► memory locality
Parallel work possible:
- Phase 1 + Phase 2 + Phase 3 can run in parallel after Phase 0
- Phase 4 items are independent of each other
- Phase 5 depends on Phase 2 but not on Phase 1/3/4
6. Validation Plan
6.1 Build Evidence
| Check | Command |
|---|---|
| Kernel compiles | make r.kernel |
| relibc compiles | make r.relibc |
| Prefix rebuilt | touch relibc kernel && make prefix |
| Full OS builds | make all CONFIG_NAME=redbear-mini |
6.2 Runtime Evidence (QEMU)
| Test | Verification |
|---|---|
| Multi-threaded boot | make qemu QEMUFLAGS="-smp 4" — all 4 CPUs active |
| pthread smoke test | Guest: compile + run simple pthread_create/join/mutex test |
| Work stealing | Guest: spawn 8 threads on 4-CPU QEMU, verify all CPUs utilized |
| Futex REQUEUE | Guest: condvar broadcast benchmark — waiters wake in ≤2 batches, not N |
| PI futex | Guest: priority inversion test — high-prio thread unblocked within 1 tick |
| Robust mutex | Guest: kill thread holding mutex, verify EOWNERDEAD recovery |
| RT scheduling | Guest: SCHED_FIFO thread preempts SCHED_OTHER within 100μs |
| CPU affinity | Guest: pin thread to CPU 1, verify it never runs on CPU 0 |
| Thread naming | Guest: cat /scheme/proc/*/name shows set names |
| Guard pages | Guest: overflow stack, verify SIGSEGV (not silent corruption) |
| TLB efficiency | Guest: mprotect benchmark — compare TLB miss rate before/after |
6.3 Validation Scripts (to create)
local/scripts/test-threading-qemu.sh # Comprehensive threading smoke test
local/scripts/test-futex-requeue-qemu.sh # REQUEUE-specific test
local/scripts/test-futex-pi-qemu.sh # PI futex test
local/scripts/test-futex-robust-qemu.sh # Robust mutex test
local/scripts/test-sched-rt-qemu.sh # RT scheduling latency test
local/scripts/test-sched-balance-qemu.sh # Load balancing on multi-vCPU
local/scripts/test-threading-baremetal.sh # Bare metal multi-threaded stress
7. Estimated Effort
| Phase | Duration | New Code | Recovery | Dependencies |
|---|---|---|---|---|
| Phase 0: Patch Recovery | 1–2 weeks | Minimal (rebase 5 patches) | 13 patches apply directly | None |
| Phase 1: Futex Completeness | 2–3 weeks | REQUEUE impl + WAKE_OP | PI/robust from P8 patches | Phase 0 |
| Phase 2: SMP Scheduling | 3–4 weeks | TLB INVLPG + broadcast opt | Work stealing from P8 | Phase 0 |
| Phase 3: RT Scheduling | 1–2 weeks | relibc sched_* API | RT dispatch from P5 | Phase 0 |
| Phase 4: POSIX Pthread | 2–3 weeks | Affinity/naming/guard/clock | Partial from P7 patches | Phase 0, 3 |
| Phase 5: IRQ + NUMA | 3–4 weeks | IRQ steering + NUMA allocator | NUMA topology from P9 | Phase 0, 2 |
Total: 12–18 weeks with 1–2 developers. Phase 0 alone recovers the majority of the value in 1–2 weeks.
8. Integration with Existing Plans
| Plan | Relationship |
|---|---|
CONSOLE-TO-KDE-DESKTOP-PLAN.md |
Consumer — Phase 3 (KWin) needs PI futex + RT scheduling; Phase 2 (compositor) needs work stealing |
IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md |
Sibling — IRQ steering (Phase 5.1) belongs to both plans |
DRM-MODERNIZATION-EXECUTION-PLAN.md |
Consumer — GPU worker threads benefit from load balancing + affinity |
IMPLEMENTATION-MASTER-PLAN.md |
Parent — this plan covers the kernel threading substrate |
CPU-DMA-IRQ-MSI-SCHEDULER-FIX-PLAN.md |
Sibling — overlaps on scheduler/IRQ delivery |
9. Bottom Line
The Red Bear OS threading stack is functional for basic single-threaded and lightly-threaded workloads. The SMP boot, context switching, TLB shootdown, and basic futex operations are correct.
The critical problem is that 6 months of threading enhancement work (P5–P9 patches) was lost during the local fork migration. This work exists as patch files that apply cleanly to the current fork — Phase 0 (Patch Recovery) is the single highest-ROI action.
After Phase 0, the remaining gaps are:
- Futex REQUEUE/PI/robust — for condvar performance and deadlock prevention
- SMP work stealing + load balancing — for multi-core utilization
- RT scheduling — for audio/compositor thread priority
- POSIX pthread completeness — for application compatibility
- IRQ steering + NUMA — for multi-socket performance
The desktop-critical path (KWin responsiveness) requires Phases 0–3. The server-critical path (multi-socket, NUMA) adds Phase 5. Phase 4 (POSIX completeness) benefits all paths but is not desktop-blocking.