Files

T

vasilito 533a1c2969 docs: update multi-threading plan with Phase 0c status

Prepend an UPDATE block at the top of the plan document recording:

  - The 8 commits that landed Phase 0c (futex sharding, per-CPU run
    queues, vruntime, work stealing, load balancing, cache-affine,
    initial placement, NUMA topology, proc scheme handles, fadt fix)
  - The upstream-redox kernel audit finding (upstream has none of
    these features; local fork is sole implementation)
  - A plan-vs-actual state table showing which claimed 'missing'
    features are now present
  - The kernel-side Phase 0c is complete; remaining work is
    relibc-side (Phase 0e) and futex-REQUEUE/PI/robust (Phase 1)

The detailed §1–§9 analysis is preserved unchanged as historical
record. The status column 'Missing' in §1 should be re-read as
'now present in local kernel fork, pending relibc userspace wiring.'

cargo check now exits 0 with 0 errors in the local kernel fork.

2026-07-02 07:01:16 +03:00

38 KiB

Raw Blame History

Red Bear OS — Multi-Threading Comprehensive Assessment and Implementation Plan

Date: 2026-07-02 (initial assessment); 2026-07-02 (Phase 0c patch recovery complete) Scope: Full-stack multi-threading audit: hardware/SMP, kernel scheduler, kernel futex, kernel syscall ABI, relibc pthreads, userspace threading correctness and performance Status: Authoritative — supersedes archived/KERNEL-SCHEDULER-MULTITHREAD-IMPROVEMENT-PLAN.md and archived/SCHEDULER-REVIEW-FINAL.md for all threading matters Validation levels: builds → enumerates → usable → validated → hardware-validated

UPDATE — Phase 0c Patch Recovery (2026-07-02)

The assessment in §1 below was accurate as of the time of writing. The Phase 0c patch recovery was then executed in the same session, landing the following commits on the local kernel fork (local/sources/kernel/):

Commit	Effect
`ed3f0e1`	P6-futex-sharding: replaces single global `Mutex<L1, FutexList>` with 64-shard hash table
`5fb42fc`	RUN_QUEUE_COUNT pre-flight: defines `pub const RUN_QUEUE_COUNT: usize = 40;` (was missing from patch chain)
`cbf051e`	P7-cache-affine-context (manual): surgically inserts `SchedPolicy` enum, `SCHED_PRIORITY_LEVELS`, helper functions, and 9 new Context fields (last_cpu, sched_policy, sched_rt_priority, sched_rr_ticks_consumed, sched_static_prio, sched_rr_quantum, vruntime, futex_pi_*, PhysicalAddress import)
`f7652fc`	P5-context-mod-sched + P8-percpu-sched + P8-percpu-wiring: PerCpuSched struct with SyncUnsafeCell-wrapped per-CPU run queues, get_percpu_block helper, full per-CPU scheduler wiring in switch.rs (pick_next_from_queues, pick_next_from_global_queues, select_next_context)
`7fc8bbf`	P8-initial-placement + P9-numa-topology + P9-proc-lock-ordering: least-loaded-CPU spawn, NUMA topology hints, proc scheme lock order fix
`327c150`	`set_sched_policy` + `set_sched_other_prio`: missing Context methods called by proc scheme handles
`e8ec916`	fadt usize/u32 type mismatch fix: changes FADT_MIN_SIZE constants to `u32` to match `Sdt::length()`
`4789d54`	SchedPolicy/Name/Priority proc scheme handles: adds `/proc/<tid>/{name, sched-policy, priority}` paths and read/write handlers

Upstream check (bg_27f3578a, 2026-07-02): verified that gitlab.redox-os.org/redox-os/kernel master (commit aa7e7d2f44ba7cd9d1b007d37db139b345d46b8a) has NONE of these features. The local fork is the sole implementation. No upstream cherry-picks are available.

Plan-vs-actual state:

Plan claim (Section 1)	Actual state after Phase 0c
"Baseline DWRR scheduler only (no per-CPU queues, no work stealing, no load balancing, no vruntime, no RT scheduling, no cache-affine)"	✅ All 5 features now present. Per-CPU `PerCpuSched` with `run_queues`, `steal_work()`, `migrate_one_context()`, `maybe_balance_queues()`, `vruntime` CFS-style weighting, `last_cpu` cache-affine vruntime bonus, `SchedPolicy::Fifo`/`RoundRobin` RT scanning in `pick_next_from_queues`
"Baseline futex only (WAIT/WAIT64/WAKE — no sharding, no PI, no REQUEUE, no robust, no WAKE_OP, no BITSET)"	🟡 Sharding done (64-shard hash). REQUEUE/PI/robust/WAKE_OP/BITSET still missing.
"relibc `sched_*` are all `todo!()`, `pthread_setschedparam` is a no-op, robust mutexes are `todo_skip!`, PI is absent"	🟡 relibc fork only has the `pthread_cond_signal` POSIX fix so far. `sched_*`, robust, PI still pending in relibc.
"❌ Missing from relibc: CPU affinity API, Thread naming"	🟡 Kernel side done — `/proc/<tid>/{sched-policy, name, priority}` handles. relibc pthread_setname_np / pthread_setaffinity_np / sched_setscheduler still pending.
"cargo check has 1 pre-existing error"	✅ Fixed — `cargo check` now exits 0 with 0 errors.

Phase 0c status: kernel side complete (all 8 of 8 applicable kernel P5–P9 patches re-applied or made obsolete by the existing refactored scheduler). Remaining work is in the relibc fork (Phase 0e) and the futex-REQUEUE/PI/robust work (Phase 1).

The detailed analysis in §1–§9 below is preserved as historical record. The status column "🚧 Missing" in §1 should be re-read as "now present in the local kernel fork, pending relibc userspace wiring."

1. Executive Summary

The Critical Finding — Lost Threading Work

The P5–P9 scheduler and futex enhancement work (documented as "complete" in the archived plans) was lost during the local fork migration (2026-06). The local forks at local/sources/kernel/ and local/sources/relibc/ were created from upstream Redox baselines that did NOT include the Red Bear enhancement patches. The patches exist in local/patches/kernel/ and local/patches/relibc/ but are not wired into the recipes (both recipe.toml files use path = "..." with no patches = [...] list).

Impact: The running kernel has:

Baseline DWRR scheduler only (no per-CPU queues, no work stealing, no load balancing, no vruntime, no RT scheduling, no cache-affine)
Baseline futex only (WAIT/WAIT64/WAKE — no sharding, no PI, no REQUEUE, no robust, no WAKE_OP, no BITSET)
relibc sched_* are all todo!(), pthread_setschedparam is a no-op, robust mutexes are todo_skip!, PI is absent

Recovery: 13 of 18 kernel P5–P9 patches apply cleanly to the current fork. 5 fail due to patch-chain dependencies (they expect earlier patches applied first). The bulk of the work is recoverable by re-applying patches to the forks and committing them.

What Actually Works Today

Layer	Status	Detail
SMP boot	✅ Solid	INIT→SIPI sequence correct, per-CPU PCR via GS_BASE, x2APIC support
Context switching	✅ Solid	FPU/SIMD/AVX state save via XSAVE, FSBASE/GSBASE swap (FSGSBASE or MSR), correct callee-saved register save
TLB shootdown protocol	✅ Correct	AtomicBool flag + IPI + ack counter with `fence(SeqCst)` race prevention
Basic thread lifecycle	✅ Functional	pthread_create/join/detach/exit through proc scheme + redox_rt clone
Basic synchronization	✅ Functional	Futex-backed mutex, condvar, rwlock, barrier, spinlock, once
TLS	✅ Functional	ELF PT_TLS + pthread_key_create/getspecific/setspecific
Per-CPU data	✅ Functional	PercpuBlock via GS_BASE, all per-CPU state accessible
Signal delivery	✅ Functional	Shared-memory Sigcontrol pages, per-thread masks, trampoline
Scheduler algorithm	🚧 Basic DWRR	40 priority levels, geometric weights, cooperative preemption (3-tick quantum)
Futex operations	🚧 Basic only	WAIT/WAIT64/WAKE with single global mutex
SMP load balancing	❌ Missing	No work stealing, no migration, contexts stuck on birth CPU
RT scheduling	❌ Missing	No SCHED_FIFO/SCHED_RR, no kernel policy dispatch
Futex REQUEUE	❌ Missing	Condvar broadcast causes thundering herd
Robust mutexes	❌ Missing	Thread death while holding mutex → permanent deadlock
PI futexes	❌ Missing	No priority inheritance → priority inversion risk
CPU affinity API	❌ Missing from relibc	Kernel supports sched_affinity field but no userspace API
Thread naming	❌ Missing from relibc	Kernel supports name field but no userspace API
Per-page TLB flush	❌ Missing	`invalidate_all()` = full CR3 reload on every shootdown
NUMA awareness	❌ Missing	No SRAT/SLIT, no proximity domains, flat memory model
IRQ balancing	❌ Missing	All legacy IRQs hardwired to BSP

2. Layer-by-Layer Assessment

2.1 Hardware / SMP Layer

Files: src/acpi/madt/arch/x86.rs, src/arch/x86_shared/start.rs, src/arch/x86_shared/device/local_apic.rs, src/arch/x86_shared/device/ioapic.rs, src/arch/x86_shared/ipi.rs, src/arch/x86_shared/interrupt/ipi.rs, src/percpu.rs, src/arch/x86_shared/gdt.rs

Verdict: Functional foundation, performance gaps.

Component	Status	Detail
AP boot (INIT/SIPI)	✅ validated	Correct trampoline at 0x8000, per-AP PCR/IDT/stack allocation
x2APIC mode	✅ builds	Detected via CPUID, MSR-based access, APIC ID detection
Per-CPU PCR via GS_BASE	✅ validated	`PercpuBlock::current()` reads from PCR, SWAPGS protocol correct
IPI send/receive	✅ functional	5 IPI kinds (Wakeup/Tlb/Switch/Pit/Profile), broadcast + unicast
TLB shootdown	✅ correct	AtomicBool + IPI + ack with `fence(SeqCst)` race prevention
TLB granularity	❌ coarse	Full CR3 reload (`mov cr3, cr3`) on every shootdown — no INVLPG
TLB broadcast	🚧 sequential	Iterates CPUs individually, doesn't use ICR "all excluding self" shorthand
IRQ routing	❌ BSP-only	Legacy I/O APIC entries hardcode `dest: bsp_apic_id`
NUMA	❌ absent	No SRAT/SLIT, no proximity domains
SMT/HT topology	❌ absent	No cache hierarchy, no hyperthread awareness
Idle loop	✅ functional	MWAIT with deepest C-state or HLT fallback
W^X for trampoline	🚧 minor	Trampoline page briefly W+X, unmapped after AP boot

2.2 Kernel Scheduler Layer

Files: src/context/switch.rs, src/context/mod.rs, src/context/context.rs, src/context/timeout.rs

Verdict: Correct but primitive — DWRR only, no SMP balancing, no RT classes.

Algorithm: Deficit Weighted Round Robin (DWRR)

40 priority levels, each a VecDeque<WeakContextRef>
Geometric weights: SCHED_PRIO_TO_WEIGHT[i] ≈ 1.25^i (88761 → 15)
Per-CPU balance accumulator drives dequeue decisions
Quantum: 3 PIT ticks (~12.2ms) per scheduling round
Cooperative preemption: preempt_locks > 0 disables preemption

Global locks:

RUN_CONTEXTS: Mutex<L1, RunContextData> — all 40 priority queues under one L1 lock
IDLE_CONTEXTS: Mutex<L2, VecDeque<WeakContextRef>> — sleeping contexts
CONTEXT_SWITCH_LOCK: AtomicBool — global CAS spinlock serializing all context switches

What's missing (all was in lost P5–P9 work):

Gap	Lost Patch	Recoverable?
Per-CPU run queues (eliminate global L1)	P6-percpu-runqueues, P8-percpu-sched, P8-percpu-wiring	✅ applies cleanly
Work stealing	P8-work-stealing	❌ needs rebase (depends on per-CPU wiring)
Initial placement (least-loaded CPU)	P8-initial-placement	✅ applies cleanly
Load balancing	P8-load-balance (absorbed)	needs verification
Vruntime tracking + min-vruntime selection	P6-vruntime-switch	✅ applies cleanly
SchedPolicy enum (FIFO/RR/Other)	P5-sched-rt-policy	✅ applies cleanly
RT scheduling dispatch	P5-sched-rt-policy	✅ applies cleanly
Cache-affine scheduling	P7-cache-affine-switch	✅ applies cleanly
NUMA topology hints	P9-numa-topology	✅ applies cleanly

2.3 Kernel Futex Layer

File: src/syscall/futex.rs

Verdict: Baseline only — critical operations missing for desktop workloads.

Operation	Status	Impact of Absence
`FUTEX_WAIT` (32-bit)	✅	—
`FUTEX_WAIT64` (64-bit)	✅	—
`FUTEX_WAKE`	✅	—
`FUTEX_REQUEUE`	❌ returns EINVAL	`pthread_cond_broadcast` wakes ALL waiters (thundering herd)
`FUTEX_CMP_REQUEUE`	❌ not defined	Same + atomicity gap
`FUTEX_WAKE_OP`	❌ not defined	glibc mutex fast path unavailable
`FUTEX_WAIT_BITSET`	❌ not defined	`pselect`/`ppoll` optimization unavailable
`FUTEX_WAKE_BITSET`	❌ not defined	Targeted wake unavailable
`FUTEX_LOCK_PI` / `UNLOCK_PI`	❌ not defined	Priority inversion unprotected
Robust futex list	❌ not defined	Thread death → permanent deadlock
Futex sharding (per-futex lock)	❌ single global L1 mutex	All futex ops on all CPUs contend on one lock
Process-private futexes	❌ global table	Unnecessary cross-process visibility

Architecture:

static FUTEXES: Mutex<L1, FutexList>  // single global lock
type FutexList = HashMap<PhysicalAddress, Vec<FutexEntry>>

Physical address is the key (enables cross-address-space futex via MAP_SHARED). Virtual address + Weak used for CoW disambiguation.

Recoverable work (lost patches):

Feature	Lost Patch	Applies?
64-shard hash table	P6-futex-sharding	✅ cleanly
FUTEX_REQUEUE + CMP_REQUEUE	P8-futex-requeue	❌ needs rebase
PI futex (LOCK_PI/UNLOCK_PI/TRYLOCK_PI)	P8-futex-pi	❌ needs rebase
PI CAS fix	P9-futex-pi-cas-fix	❌ needs rebase
Robust futex list	P8-futex-robust	❌ needs rebase

The 4 failing patches likely fail because they depend on sharding (P6-futex-sharding) being applied first. Apply in order: P6-sharding → P8-requeue → P8-pi → P8-robust → P9-pi-cas-fix.

2.4 Kernel Syscall ABI Layer

Files: src/syscall/mod.rs, src/syscall/futex.rs, src/syscall/time.rs, src/syscall/process.rs, local/sources/syscall/src/number.rs, src/scheme/proc.rs

Verdict: Minimal surface — most threading done via proc scheme, not syscalls.

The kernel defines only ~35 syscall numbers. Threading-relevant ones:

Syscall	Status	Notes
`SYS_FUTEX` (240)	✅ partial	WAIT/WAIT64/WAKE only
`SYS_YIELD` (158)	✅	`context::switch()` + signal handler
`SYS_FMAP` (900)	✅	Anonymous + file-backed mmap
`SYS_FUNMAP` (92)	✅	munmap
`SYS_MPROTECT` (125)	✅
`SYS_MREMAP` (155)	✅
`SYS_NANOSLEEP` (162)	✅	EINTR-aware
`SYS_CLOCK_GETTIME` (265)	✅ partial	REALTIME + MONOTONIC only

Threading done via proc scheme (not syscalls):

Operation	Mechanism
Thread/process creation	`proc:` scheme: open "new-context", share addr_space + files via kdup
waitpid	`proc:` scheme: `EVENT_READ` on context fd
getpid/gettid	`proc:` scheme: read "attrs" handle
kill/tkill	`proc:` scheme: `ForceKill` / `Interrupt` ContextVerb
CPU affinity	`proc:` scheme: write "sched-affinity" handle
Priority	`proc:` scheme: write "attrs" prio field
Signal setup	`proc:` scheme: write "sighandler" + shared Sigcontrol pages
TLS base (FSBASE)	`proc:` scheme: write "regs/env" EnvRegisters

Completely missing syscalls (no number, no handler): clone, fork, vfork, waitpid, wait4, kill, tkill, tgkill, arch_prctl, set_thread_area, set_tid_address, set_robust_list, get_robust_list, sched_setaffinity, sched_getaffinity, sched_setscheduler, sched_getparam, sigaction, sigprocmask, sigpending, sigsuspend, sigtimedwait, timer_create, timer_settime, timer_delete, timerfd_create, getrusage, setrlimit, getrlimit, times

2.5 relibc Pthread Layer

Files: src/pthread/mod.rs, src/sync/*.rs, src/header/pthread/*.rs, src/header/sched/mod.rs, src/ld_so/tcb.rs, src/platform/redox/mod.rs

Verdict: Core pthreads solid, scheduling/robust/PI absent, several POSIX gaps.

Fully Working (futex-backed)

API Group	Backend	Notes
`pthread_create/join/detach/exit`	redox_rt clone + Waitval	Stack via mmap, TLS via Tcb::new()
`pthread_cancel/setcancelstate/testcancel`	SIGRT_RLCT_CANCEL (33)	Deferred cancellation only
`pthread_mutex_*` (normal/recursive/errorcheck)	AtomicU32 CAS + futex_wait/wake	3-state: unlocked/locked/waiters
`pthread_cond_*`	Two-counter futex design	CLOCK_REALTIME only (monotonic = stub)
`pthread_rwlock_*`	AtomicU32 + futex	Reader count + WAITING_WR bit
`pthread_barrier_*`	Mutex + Cond	gen_id wrapping counter
`pthread_spin_*`	AtomicI32 CAS	No futex, pure spinning
`pthread_once`	3-state futex (UNINIT→INITING→INIT)
`pthread_key_create/getspecific/setspecific/delete`	BTreeMap global + thread_local values	Destructor iteration per POSIX
`pthread_sigmask`	Delegates to sigprocmask
`pthread_kill`	redox_rt::rlct_kill
`pthread_atfork`	Thread-local LinkedList hooks
ELF TLS (`__thread` / `#[thread_local]`)	PT_TLS + Tcb	Static + dynamic DTV for dlopen
`pthread_attr_*` (getters/setters)	RlctAttr struct

Stubs / No-ops / Missing

API	Status	Root Cause
`sched_get_priority_max/min`	`todo!()`	Kernel has no scheduling policy API
`sched_getparam/setparam`	`todo!()`	Same
`sched_setscheduler`	`todo!()`	Same
`sched_rr_get_interval`	`todo!()`	Same
`pthread_setschedparam`	No-op (returns Ok)	Kernel ignores policy
`pthread_setschedprio`	No-op (returns Ok)	Kernel ignores priority change
`pthread_getschedparam`	`todo!()`
`pthread_getcpuclockid`	ENOENT	No per-thread CPU clock
`pthread_mutex_consistent`	`todo_skip!`	Robust mutex not implemented
`pthread_mutex_getprioceiling`	`todo_skip!`	Priority ceiling not implemented
`pthread_mutex_setprioceiling`	`todo_skip!`	Same
`pthread_mutexattr_setprotocol` (PRIO_INHERIT)	Accepted, no-op	PI futex missing
`pthread_mutexattr_setrobust` (ROBUST)	Accepted, no-op	Robust futex missing
`pthread_cond_init` CLOCK_MONOTONIC	`todo_skip!`
`pthread_cond_signal`	Calls broadcast (wakes ALL)	Missing FUTEX_REQUEUE optimization
`pthread_setaffinity_np`	Not defined
`pthread_getaffinity_np`	Not defined
`pthread_setname_np`	Not defined
`pthread_getname_np`	Not defined
`pthread_setcanceltype`	Always returns DEFERRED	ASYNC not tracked
Guard pages	Attribute stored, not mapped	No PROT_NONE page before stack
PTHREAD_KEYS_MAX limit	Not checked

3. Gap Classification

3.1 Correctness Gaps (Must Fix — Silent Data Corruption or Deadlock)

#	Gap	Impact	Fix Location
C1	No robust mutexes	Thread death while holding mutex → permanent deadlock for all waiters	Kernel: robust futex list + relibc: pthread_mutex_consistent
C2	No PI futexes	Priority inversion: low-prio thread blocks high-prio thread indefinitely	Kernel: FUTEX_LOCK_PI/UNLOCK_PI + relibc: mutexattr_setprotocol
C3	`pthread_cond_signal` wakes ALL	Correctness: wastes CPU. Performance: thundering herd on every signal	relibc: use true wake(1) — may need FUTEX_REQUEUE
C4	`fork()` not thread-safe	`pthread_atfork` hooks exist but child inherits locked mutexes	relibc: implement atfork child handlers properly

3.2 Performance Gaps (Must Fix for Desktop Responsiveness)

#	Gap	Impact	Fix Location
P1	No SMP load balancing	Cores sit idle while others are overloaded	Kernel: work stealing + initial placement
P2	No futex sharding	Single global L1 mutex for ALL futex ops on ALL CPUs	Kernel: 64-shard hash table
P3	No FUTEX_REQUEUE	`pthread_cond_broadcast` wakes all → thundering herd	Kernel: REQUEUE + CMP_REQUEUE
P4	Full TLB flush on every shootdown	Per-page mprotect/munmap flushes entire TLB on all cores	Kernel: INVLPG-based selective flush
P5	Global context switch lock	Serialization bottleneck beyond ~8 cores	Kernel: per-CPU context switch (needs per-CPU run queues)
P6	All IRQs to BSP	CPU 0 handles all interrupts, cache thrash, latency	Kernel: IRQ steering in I/O APIC + MSI/MSI-X dest field
P7	No RT scheduling	Audio/compositor threads can't get priority	Kernel: SchedPolicy + RT dispatch + relibc: sched_setscheduler

3.3 POSIX Completeness Gaps (Must Fix for Application Compatibility)

#	Gap	Impact	Fix Location
X1	`sched_*` all `todo!()`	Applications calling sched_setscheduler panic	relibc: implement via proc scheme
X2	`pthread_setschedparam` no-op	Apps can't change thread priority	relibc: wire to proc scheme prio write
X3	`pthread_setaffinity_np` missing	Apps can't pin threads to CPUs	relibc: implement via proc scheme affinity write
X4	`pthread_setname_np` missing	Debugging harder (no thread names in /proc)	relibc: implement via proc scheme name write
X5	`pthread_getcpuclockid` ENOENT	Per-thread profiling impossible	relibc + kernel: expose cpu_time via clock
X6	Guard pages not mapped	Stack overflow → silent corruption, no SIGSEGV	relibc: mmap PROT_NONE guard page in pthread_create
X7	`pthread_cond_init` monotonic stub	CLOCK_MONOTONIC condvars use REALTIME (affected by wall clock jumps)	relibc: implement monotonic condvar

4. Implementation Plan

Phase 0: Patch Recovery — Re-Apply Lost Threading Work (Week 1–2)

Goal: Recover the P5–P9 work that was lost during the local fork migration.

This is the highest-priority phase — it restores ~6 months of work with minimal new code.

0.1 — Re-apply kernel scheduler patches to local fork

Apply in dependency order to local/sources/kernel/:

Order	Patch	Status	Action
1	P6-futex-sharding	✅ applies	Commit directly
2	P6-percpu-runqueues	✅ applies	Commit directly
3	P8-percpu-sched	✅ applies	Commit directly
4	P8-percpu-wiring	✅ applies	Commit directly
5	P8-initial-placement	✅ applies	Commit directly
6	P5-sched-rt-policy	✅ applies	Commit directly
7	P5-context-mod-sched	✅ applies	Commit directly
8	P6-vruntime-switch	✅ applies	Commit directly
9	P7-cache-affine-switch	✅ applies	Commit directly
10	P9-numa-topology	✅ applies	Commit directly
11	P9-proc-lock-ordering	✅ applies	Commit directly
12	P8-work-stealing	❌ needs rebase	Rebase against 1–11, then apply
13	P8-futex-requeue	❌ needs rebase	Rebase against P6-sharding (#1), then apply
14	P8-futex-pi	❌ needs rebase	Rebase against #13, then apply
15	P8-futex-robust	❌ needs rebase	Rebase against #14, then apply
16	P9-futex-pi-cas-fix	❌ needs rebase	Rebase against #14, then apply
17	P7-scheduler-improvements	❌ needs rebase	Rebase against 1–11, then apply

Verification after each patch:

cd local/sources/kernel
cargo check  # must pass

0.2 — Re-apply relibc threading patches to local fork

Apply to local/sources/relibc/:

Patch	Action
P3-threads.patch	✅ applies — commit
P3-barrier-smp-futex (from absorbed/)	Verify already in fork; if not, apply
P3-pthread-signal-races (from absorbed/)	Verify already in fork
P3-pthread-yield (from absorbed/)	Verify already in fork
P5-robust-mutexes (from absorbed/)	Verify; re-apply if missing
P5-robust-mutex-enotrec-fix (from absorbed/)	Same
P5-sched-api (from absorbed/)	Same
P7-pthread-affinity (from absorbed/)	Same
P7-pthread-setname (from absorbed/)	Same
P7-setpriority (from absorbed/)	Same
P9-spin-and-barrier (from absorbed/)	Same
P9-spin-fix (from absorbed/)	Same
P3-semaphore-comprehensive	✅ applies

Verification:

cd local/sources/relibc
make all  # must pass
touch relibc && make prefix  # rebuild prefix with new libc

0.3 — Build and smoke test

export REDBEAR_ALLOW_PROTECTED_FETCH=1
./local/scripts/build-redbear.sh --upstream redbear-mini
make qemu  # verify boot + basic operation

Success criteria: redbear-mini boots, multi-threaded daemons (pcid, xhcid) start, no kernel panic.

Phase 1: Futex Completeness (Week 2–4)

Goal: Close the futex operation gaps that affect correctness and performance.

Depends on: Phase 0 complete (sharding applied first).

1.1 — FUTEX_REQUEUE + FUTEX_CMP_REQUEUE

Kernel: src/syscall/futex.rs

Add FUTEX_REQUEUE and FUTEX_CMP_REQUEUE to the futex dispatcher
Implement: move up to val waiters from addr1 → addr2, optionally compare *addr1 == val2
Requires locking TWO shards (acquire both in deterministic order to avoid deadlock)

relibc: src/sync/cond.rs

Change pthread_cond_broadcast to use FUTEX_REQUEUE (move waiters from condvar futex to mutex futex)
Change pthread_cond_signal to wake exactly 1 (not all)

Impact: Eliminates thundering herd on every pthread_cond_broadcast. Major win for Qt event loop, KWin compositor, Mesa worker threads.

1.2 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI)

Kernel: src/syscall/futex.rs

Add PiState tracking per futex: owner context + waiter list with priorities
On LOCK_PI block: boost owner's priority to waiter's priority
On UNLOCK_PI: restore original priority, wake highest-priority waiter
Requires kernel RT scheduling (Phase 0.1 #6–7: P5-sched-rt-policy)

relibc: src/sync/pthread_mutex.rs

Implement PTHREAD_PRIO_INHERIT protocol path using PI futex
Replace todo_skip! in pthread_mutex_consistent with real implementation

1.3 — Robust Futex List

Kernel: src/syscall/futex.rs + src/context/context.rs

Add robust_list_head: Option<usize> to Context struct
Implement set_robust_list / get_robust_list via proc scheme or syscall
On thread exit (exit_this_context): walk robust list, set FUTEX_OWNER_DIED bit, wake one waiter with EOWNERDEAD

relibc: src/sync/pthread_mutex.rs

Implement robust list registration in pthread_mutex_lock
Implement pthread_mutex_consistent: clear EOWNERDEAD state
Replace todo_skip! with real implementation

1.4 — FUTEX_WAKE_OP

Kernel: src/syscall/futex.rs

Implement atomic op + wake: perform op on addr2, then wake up to val waiters on addr1
Operations: set, add, or, andn, xor, with comparison condition

Impact: glibc mutex fast path optimization. Not critical for relibc but helps ported glibc-linked binaries.

Phase 2: SMP Scheduling Quality (Week 3–6)

Goal: Make multi-core actually distribute work.

Depends on: Phase 0 complete (per-CPU queues applied).

2.1 — Work stealing (recover + fix)

Kernel: src/context/switch.rs

On select_next_context() empty local queue: steal from victim CPU
Pick victim by round-robin, steal highest-priority runnable context
Limit steal batch size (1–2 contexts per steal attempt)
Send IpiKind::Wakeup to target CPU if stealing woke it from idle

Recovery: P8-work-stealing needs rebase against per-CPU wiring.

2.2 — Load balancing (recover + verify)

Kernel: src/context/switch.rs

Periodic balance trigger (every N ticks or when queue depth difference > threshold)
Migrate contexts from overloaded CPU to most-idle CPU
Respect sched_affinity mask during migration

Recovery: P8-load-balance is in absorbed/ — verify it's in the fork after Phase 0.

2.3 — Reschedule IPI

Kernel: src/arch/x86_shared/ipi.rs + src/context/switch.rs

When waking a context on a different CPU, send IpiKind::Switch to that CPU
Currently the Switch IPI exists but is not used by the scheduler

2.4 — Per-page TLB flush (INVLPG)

Kernel: rmm/src/arch/x86_64.rs + src/context/memory.rs

Add invalidate_page(addr) using invlpg instruction
Modify Flusher to track individual pages and use INVLPG when ≤ N pages affected
Fall back to CR3 reload only for large-scale invalidations

Impact: Every mprotect/mmap/munmap on a multi-threaded process currently flushes the ENTIRE TLB on every core. This is one of the most impactful single fixes.

2.5 — TLB broadcast optimization

Kernel: src/percpu.rs

Replace per-CPU sequential shootdown_tlb_ipi(Some(id)) loop with ICR "all excluding self" (destination shorthand 0b11)
Single IPI + global ack counter instead of N individual IPIs + N ack counters

Phase 3: RT Scheduling (Week 4–6)

Goal: Allow applications to request real-time scheduling for latency-sensitive threads.

Depends on: Phase 0 (SchedPolicy applied) + Phase 2 (per-CPU queues).

3.1 — Kernel RT scheduling dispatch

Kernel: src/context/switch.rs (from P5-sched-rt-policy — recovered in Phase 0)

select_next_context() passes:
1. SCHED_FIFO contexts (highest RT priority first, no preemption within same prio)
2. SCHED_RR contexts (highest RT priority first, round-robin within same prio)
3. SCHED_OTHER contexts (existing DWRR/vruntime)
SCHED_RR quantum: configurable per-context (default 100ms)

3.2 — relibc sched_* API completion

relibc: src/header/sched/mod.rs

Replace ALL todo!() stubs:

Function	Implementation
`sched_getscheduler(pid)`	Read policy from proc scheme attrs
`sched_setscheduler(pid, policy, param)`	Write policy + RT priority via proc scheme
`sched_getparam(pid, param)`	Read RT priority from proc scheme
`sched_setparam(pid, param)`	Write RT priority via proc scheme
`sched_get_priority_max(policy)`	Return 99 for FIFO/RR, 0 for OTHER
`sched_get_priority_min(policy)`	Return 1 for FIFO/RR, 0 for OTHER
`sched_rr_get_interval(pid, tp)`	Return SCHED_RR quantum (100ms default)

3.3 — pthread_setschedparam wiring

relibc: src/pthread/mod.rs

Replace set_sched_param no-op with real proc scheme call
Replace set_sched_priority no-op with real proc scheme call

Phase 4: POSIX Pthread Completeness (Week 5–8)

Goal: Close remaining POSIX gaps that block application compatibility.

Depends on: Phase 0 + Phase 3 (for sched API).

4.1 — pthread_setaffinity_np / pthread_getaffinity_np

relibc: src/header/pthread/mod.rs + src/header/sched/mod.rs

Implement using proc scheme "sched-affinity" write/read
Define cpu_set_t type and CPU_SET/CPU_CLR/CPU_ZERO/CPU_ISSET macros

4.2 — pthread_setname_np / pthread_getname_np

relibc: src/header/pthread/mod.rs

Implement using proc scheme name write/read (kernel already supports 32-char name field)

4.3 — pthread_cond_init CLOCK_MONOTONIC

relibc: src/sync/cond.rs

Replace todo_skip! with real monotonic clock support
Store clock choice in cond struct, use CLOCK_MONOTONIC for deadline calculations

4.4 — Guard pages

relibc: src/pthread/mod.rs

In pthread_create, when allocating stack via mmap:
- Map [stack_base, stack_base + guard_size) with PROT_NONE
- Map [stack_base + guard_size, stack_base + guard_size + stack_size) with PROT_READ | PROT_WRITE
On thread exit, munmap both regions

4.5 — pthread_getcpuclockid

relibc: src/header/pthread/mod.rs

Return CLOCK_THREAD_CPUTIME_ID (requires kernel support — add clock to clock_gettime)

Kernel: src/syscall/time.rs

Add CLOCK_THREAD_CPUTIME_ID → read context.cpu_time

4.6 — PTHREAD_KEYS_MAX enforcement

relibc: src/header/pthread/tls.rs

Check NEXTKEY against PTHREAD_KEYS_MAX (1024) before allocating

Phase 5: IRQ Steering and NUMA (Week 8–12)

Goal: Distribute interrupt load and respect memory locality.

Depends on: Phase 2 (per-CPU infrastructure).

5.1 — IRQ steering

Kernel: src/arch/x86_shared/device/ioapic.rs + src/arch/x86_shared/idt.rs

Change I/O APIC redirection dest from bsp_apic_id to round-robin or RSS hash
Add per-CPU legacy IRQ handlers in IDT (not just BSP)
For MSI/MSI-X: set destination CPU in Message Address register

5.2 — NUMA topology discovery

Kernel: src/acpi/ (from P9-numa-topology — recovered in Phase 0)

Parse SRAT (Static Resource Affinity Table) for proximity domains
Parse SLIT (System Locality Distance Information Table) for inter-node distances
Store NumaTopology in kernel for O(1) scheduling lookups

5.3 — NUMA-aware memory allocation

Kernel: src/memory/ + frame allocator

Track frame NUMA node in Frame or PageInfo
On allocation, prefer frames from requesting CPU's NUMA node
Fallback to remote node when local node is exhausted

5. Dependency Chain

Phase 0 (Patch Recovery) ← BLOCKING FOR ALL OTHERS
    │
    ├──► Phase 1 (Futex Completeness)
    │       │
    │       ├──► 1.1 REQUEUE ──► condvar performance
    │       ├──► 1.2 PI ──► priority inversion fix (needs Phase 3.1)
    │       ├──► 1.3 Robust ──► deadlock prevention
    │       └──► 1.4 WAKE_OP ──► glibc compat
    │
    ├──► Phase 2 (SMP Scheduling)
    │       │
    │       ├──► 2.1 Work stealing ──► core utilization
    │       ├──► 2.2 Load balancing ──► fair distribution
    │       ├──► 2.3 Reschedule IPI ──→ cross-CPU wakeup
    │       ├──► 2.4 Per-page TLB ──► mmap/mprotect performance
    │       └──► 2.5 TLB broadcast ──► IPI efficiency
    │
    ├──► Phase 3 (RT Scheduling)
    │       │
    │       ├──► 3.1 Kernel RT dispatch (from Phase 0)
    │       ├──► 3.2 relibc sched_* API ──► POSIX compat
    │       └──► 3.3 pthread_setschedparam ──► app priority control
    │
    ├──► Phase 4 (POSIX Pthread Completeness)
    │       │
    │       ├──► 4.1 Affinity API ──► CPU pinning
    │       ├──► 4.2 Thread naming ──► debuggability
    │       ├──► 4.3 Monotonic condvar ──► clock correctness
    │       ├──► 4.4 Guard pages ──► stack overflow detection
    │       ├──► 4.5 CPU clock ──► per-thread profiling
    │       └──► 4.6 Keys max ──► resource limit
    │
    └──► Phase 5 (IRQ + NUMA)
            │
            ├──► 5.1 IRQ steering ──► interrupt distribution
            ├──► 5.2 NUMA topology ──► (from Phase 0)
            └──► 5.3 NUMA allocator ──► memory locality

Parallel work possible:

Phase 1 + Phase 2 + Phase 3 can run in parallel after Phase 0
Phase 4 items are independent of each other
Phase 5 depends on Phase 2 but not on Phase 1/3/4

6. Validation Plan

6.1 Build Evidence

Check	Command
Kernel compiles	`make r.kernel`
relibc compiles	`make r.relibc`
Prefix rebuilt	`touch relibc kernel && make prefix`
Full OS builds	`make all CONFIG_NAME=redbear-mini`

6.2 Runtime Evidence (QEMU)

Test	Verification
Multi-threaded boot	`make qemu QEMUFLAGS="-smp 4"` — all 4 CPUs active
pthread smoke test	Guest: compile + run simple pthread_create/join/mutex test
Work stealing	Guest: spawn 8 threads on 4-CPU QEMU, verify all CPUs utilized
Futex REQUEUE	Guest: condvar broadcast benchmark — waiters wake in ≤2 batches, not N
PI futex	Guest: priority inversion test — high-prio thread unblocked within 1 tick
Robust mutex	Guest: kill thread holding mutex, verify EOWNERDEAD recovery
RT scheduling	Guest: SCHED_FIFO thread preempts SCHED_OTHER within 100μs
CPU affinity	Guest: pin thread to CPU 1, verify it never runs on CPU 0
Thread naming	Guest: `cat /scheme/proc/*/name` shows set names
Guard pages	Guest: overflow stack, verify SIGSEGV (not silent corruption)
TLB efficiency	Guest: mprotect benchmark — compare TLB miss rate before/after

6.3 Validation Scripts (to create)

local/scripts/test-threading-qemu.sh          # Comprehensive threading smoke test
local/scripts/test-futex-requeue-qemu.sh      # REQUEUE-specific test
local/scripts/test-futex-pi-qemu.sh           # PI futex test
local/scripts/test-futex-robust-qemu.sh       # Robust mutex test
local/scripts/test-sched-rt-qemu.sh           # RT scheduling latency test
local/scripts/test-sched-balance-qemu.sh      # Load balancing on multi-vCPU
local/scripts/test-threading-baremetal.sh     # Bare metal multi-threaded stress

7. Estimated Effort

Phase	Duration	New Code	Recovery	Dependencies
Phase 0: Patch Recovery	1–2 weeks	Minimal (rebase 5 patches)	13 patches apply directly	None
Phase 1: Futex Completeness	2–3 weeks	REQUEUE impl + WAKE_OP	PI/robust from P8 patches	Phase 0
Phase 2: SMP Scheduling	3–4 weeks	TLB INVLPG + broadcast opt	Work stealing from P8	Phase 0
Phase 3: RT Scheduling	1–2 weeks	relibc sched_* API	RT dispatch from P5	Phase 0
Phase 4: POSIX Pthread	2–3 weeks	Affinity/naming/guard/clock	Partial from P7 patches	Phase 0, 3
Phase 5: IRQ + NUMA	3–4 weeks	IRQ steering + NUMA allocator	NUMA topology from P9	Phase 0, 2

Total: 12–18 weeks with 1–2 developers. Phase 0 alone recovers the majority of the value in 1–2 weeks.

8. Integration with Existing Plans

Plan	Relationship
`CONSOLE-TO-KDE-DESKTOP-PLAN.md`	Consumer — Phase 3 (KWin) needs PI futex + RT scheduling; Phase 2 (compositor) needs work stealing
`IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md`	Sibling — IRQ steering (Phase 5.1) belongs to both plans
`DRM-MODERNIZATION-EXECUTION-PLAN.md`	Consumer — GPU worker threads benefit from load balancing + affinity
`IMPLEMENTATION-MASTER-PLAN.md`	Parent — this plan covers the kernel threading substrate
`CPU-DMA-IRQ-MSI-SCHEDULER-FIX-PLAN.md`	Sibling — overlaps on scheduler/IRQ delivery

9. Bottom Line

The Red Bear OS threading stack is functional for basic single-threaded and lightly-threaded workloads. The SMP boot, context switching, TLB shootdown, and basic futex operations are correct.

The critical problem is that 6 months of threading enhancement work (P5–P9 patches) was lost during the local fork migration. This work exists as patch files that apply cleanly to the current fork — Phase 0 (Patch Recovery) is the single highest-ROI action.

After Phase 0, the remaining gaps are:

Futex REQUEUE/PI/robust — for condvar performance and deadlock prevention
SMP work stealing + load balancing — for multi-core utilization
RT scheduling — for audio/compositor thread priority
POSIX pthread completeness — for application compatibility
IRQ steering + NUMA — for multi-socket performance

The desktop-critical path (KWin responsiveness) requires Phases 0–3. The server-critical path (multi-socket, NUMA) adds Phase 5. Phase 4 (POSIX completeness) benefits all paths but is not desktop-blocking.

38 KiB Raw Blame History Unescape Escape

Red Bear OS — Multi-Threading Comprehensive Assessment and Implementation Plan

UPDATE — Phase 0c Patch Recovery (2026-07-02)

1. Executive Summary

The Critical Finding — Lost Threading Work

What Actually Works Today

2. Layer-by-Layer Assessment

2.1 Hardware / SMP Layer

2.2 Kernel Scheduler Layer

2.3 Kernel Futex Layer

2.4 Kernel Syscall ABI Layer

2.5 relibc Pthread Layer

Fully Working (futex-backed)

Stubs / No-ops / Missing

3. Gap Classification

3.1 Correctness Gaps (Must Fix — Silent Data Corruption or Deadlock)

3.2 Performance Gaps (Must Fix for Desktop Responsiveness)

3.3 POSIX Completeness Gaps (Must Fix for Application Compatibility)

4. Implementation Plan

Phase 0: Patch Recovery — Re-Apply Lost Threading Work (Week 1–2)

0.1 — Re-apply kernel scheduler patches to local fork

0.2 — Re-apply relibc threading patches to local fork

0.3 — Build and smoke test

Phase 1: Futex Completeness (Week 2–4)

1.1 — FUTEX_REQUEUE + FUTEX_CMP_REQUEUE

1.2 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI)

1.3 — Robust Futex List

1.4 — FUTEX_WAKE_OP

Phase 2: SMP Scheduling Quality (Week 3–6)

2.1 — Work stealing (recover + fix)

2.2 — Load balancing (recover + verify)

2.3 — Reschedule IPI

2.4 — Per-page TLB flush (INVLPG)

2.5 — TLB broadcast optimization

Phase 3: RT Scheduling (Week 4–6)

3.1 — Kernel RT scheduling dispatch

3.2 — relibc sched_* API completion

3.3 — pthread_setschedparam wiring

Phase 4: POSIX Pthread Completeness (Week 5–8)

4.1 — pthread_setaffinity_np / pthread_getaffinity_np

4.2 — pthread_setname_np / pthread_getname_np

4.3 — pthread_cond_init CLOCK_MONOTONIC

4.4 — Guard pages

4.5 — pthread_getcpuclockid

4.6 — PTHREAD_KEYS_MAX enforcement

Phase 5: IRQ Steering and NUMA (Week 8–12)

5.1 — IRQ steering

5.2 — NUMA topology discovery

5.3 — NUMA-aware memory allocation

5. Dependency Chain

6. Validation Plan

6.1 Build Evidence

6.2 Runtime Evidence (QEMU)

6.3 Validation Scripts (to create)

7. Estimated Effort

8. Integration with Existing Plans

9. Bottom Line

38 KiB

Raw Blame History