P0-P2: Barrier SMP, sigmask/pthread_kill races, robust mutexes, RT scheduling, POSIX sched API P3: PerCpuSched struct, per-CPU wiring, work stealing, load balancing, initial placement P4: 64-shard futex table, REQUEUE, PI futexes (LOCK_PI/UNLOCK_PI/TRYLOCK_PI), robust futexes, vruntime tracking, min-vruntime SCHED_OTHER selection P5: setpriority/getpriority, pthread_setaffinity_np, pthread_setname_np, pthread_setschedparam (Redox) P6: Cache-affine scheduling (last_cpu + vruntime bonus), NUMA topology kernel hints + numad userspace daemon Stability fixes: make_consistent stores 0 (dead TID fix), cond.rs error propagation, SPIN_COUNT adaptive spinning, Sys::open &str fix, PI futex CAS race, proc.rs lock ordering, barrier destroy Patches: 33 kernel + 58 relibc patches, all tracked in recipes Docs: KERNEL-SCHEDULER-MULTITHREAD-IMPROVEMENT-PLAN.md updated, SCHEDULER-REVIEW-FINAL.md created Architecture: NUMA topology parsing stays userspace (numad daemon), kernel stores lightweight NumaTopology hints
42 KiB
Red Bear OS — Kernel Scheduler, Multithreading, and IPC Performance Improvement Plan
Date: 2026-04-30
Scope: Kernel scheduler optimization, futex enhancements, multithreaded performance, relibc POSIX threading completeness
Status: S3 complete (per-CPU + stealing + balancing + placement), S4 complete (futex sharding + REQUEUE + PI + robust + vruntime), S5 complete (setpriority + affinity + naming + schedparam), S6 partial (cache-affine delivered, NUMA deferred). This is the canonical scheduler + multithreading authority, extending KERNEL-IPC-CREDENTIAL-PLAN.md and RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md
1. Executive Summary
The Redox microkernel currently uses a Deficit Weighted Round Robin (DWRR) scheduler with 40 static priority levels, per-CPU run queues, and cooperative preemption. The relibc C library provides a largely complete pthreads implementation, but POSIX scheduling APIs (sched_*, pthread_setschedparam) are stubbed out. For the KDE/Wayland desktop path, multithreaded performance bottlenecks in the scheduler and futex subsystem will become the dominant limitation once the compositor (KWin) and GPU rendering pipelines are active.
Current State at a Glance
| Area | Status | Key Gaps |
|---|---|---|
| Kernel scheduler | DWRR, 40 levels, vruntime selection for SCHED_OTHER, RT pass for FIFO/RR | Per-CPU run queues are infrastructure only; load balancing deferred |
| Futex | WAIT/WAIT64/WAKE + 64-shard hash table | No PI, no requeue, no robust futex |
| relibc pthreads | Create/join/detach/mutex/cond/rwlock/barrier/spin/tls | sched_* all todo!(), no PI/robust mutexes, no affinity API |
| Thread management | proc: scheme clone/fork/exec | No dynamic priority, no CPU affinity from userspace, no thread groups |
| IPC for threading | Futex, shared memory, signals | No process-shared robust/PI mutexes, no adaptive spinning |
Why This Matters for the Desktop Path
KWin compositor (Qt6/QPA/Wayland)
└── Worker threads: rendering, input, effects
└── Requires: efficient futex wakeups, PI for compositor lock
└── Requires: SCHED_RR for input thread priority
Mesa GPU driver (LLVMpipe or hardware)
└── Gallium worker threads: shader compilation, draw submission
└── Requires: load-balanced scheduling across all CPUs
└── Requires: non-contended futex performance
Qt6 event loop
└── Thread pool for QFuture/QtConcurrent
└── Requires: SCHED_OTHER fair scheduling under load
└── Requires: proper pthread_attr_setschedparam
2. Current Architecture Assessment
2.1 Scheduler Architecture
File: recipes/core/kernel/source/src/context/switch.rs
Algorithm: Deficit Weighted Round Robin (DWRR) — documented at line 354:
/// This is the scheduler function which currently utilises Deficit Weighted Round Robin Scheduler
fn select_next_context(...)
Key data structures (from context/mod.rs):
// 40 priority levels, each with its own queue
pub struct RunContextData {
set: [VecDeque<WeakContextRef>; 40],
}
// Global lock for run queues (L1 = highest-level lock)
static RUN_CONTEXTS: Mutex<L1, RunContextData> = ...;
// Idle/sleeping contexts — scanned linearly on every tick
static IDLE_CONTEXTS: Mutex<L2, VecDeque<WeakContextRef>> = ...;
// All contexts (for enumeration)
static CONTEXTS: RwLock<L2, BTreeSet<ContextRef>> = ...;
Priority weights (geometric decay ~1.25x per level):
const SCHED_PRIO_TO_WEIGHT: [usize; 40] = [
88761, 71755, 56483, 46273, 36291, 29154, 23254, 18705, 14949, 11916,
9548, 7620, 6100, 4904, 3906, 3121, 2501, 1991, 1586, 1277,
1024, 820, 655, 526, 423, 335, 272, 215, 172, 137,
110, 87, 70, 56, 45, 36, 29, 23, 18, 15,
];
Time quantum: 3 PIT ticks per context (~12.2ms). PIT channel 0 has divisor 4847 at 1.193182 MHz → ~4.062ms per tick. 3 ticks → ~12.2ms between context switches. The 6.75ms in the tick() comment is outdated.
Default priority: 20 (middle of range).
Max scheduler iterations: 5000 per select_next_context call (bail-out limit).
Per-CPU state: percpu.balance: [usize; 40] (deficit counters), percpu.last_queue (round-robin position).
Preemption: Preemptible unless context.preempt_locks > 0 (guarded by PreemptGuard RAII wrappers).
Context switch lock: Global arch::CONTEXT_SWITCH_LOCK — spinlock with compare_exchange_weak on Ordering::SeqCst.
Current limitations:
- No real-time scheduling wired to userspace — kernel has SchedPolicy enum and RT scheduling pass, but relibc
sched_setschedulerreturns ENOSYS for FIFO/RR until kernel wire-up is complete. - No dynamic priority adjustment —
context.priois set once and never changes. vruntime-based fairness compensates for SCHED_OTHER but no nice-value decay/boost. - No work stealing — each CPU only dequeues from its own queues. A CPU can go idle while another has backlog.
- No load balancing — newly created contexts go to the creating CPU's idle queue. No migration across CPUs.
- O(n) idle wakeup scan —
wakeup_contexts()linearly scans the entireIDLE_CONTEXTSVecDeque on every tick (every ~2.25ms effective). - Single global context switch lock —
arch::CONTEXT_SWITCH_LOCKserializes all CPU context switches on many-core systems. - No NUMA awareness — memory locality is not considered during scheduling.
- No timeslice scaling — all contexts get the same 3-tick quantum regardless of priority (priority only affects how often they're picked, not how long they run).
- Large fixed iteration limit — 5000 iterations per schedule attempt can cause latency spikes under heavy load.
2.2 Context/Thread Model
File: recipes/core/kernel/source/src/context/context.rs
pub struct Context {
pub prio: usize, // Priority (0-39, default 20)
pub status: Status, // Runnable / Blocked / HardBlocked / Dead
pub running: bool, // Currently on a CPU
pub cpu_id: Option<LogicalCpuId>,// Which CPU this context is on
pub sched_affinity: LogicalCpuSet, // Allowed CPU set
pub cpu_time: u128, // Accumulated CPU time (nanoseconds)
pub switch_time: u128, // Last switch-in time
pub wake: Option<u128>, // Wake timestamp for timed sleeps
pub preempt_locks: usize, // Preemption disable counter
pub kfx: AlignedBox, // SIMD/FPU save area
pub addr_space: Option<Arc<AddrSpaceWrapper>>, // Can be shared (threads)
pub files: Arc<LockedFdTbl>, // Can be shared (same process threads)
pub owner_proc_id: Option<NonZeroUsize>, // Parent process
pub name: ArrayString<32>, // Human-readable name
// Credentials:
pub euid: u32, pub egid: u32, pub pid: usize,
pub groups: Vec<u32>, // Supplementary groups
}
Thread creation flow:
pthread_create()
→ relibc::pthread::create()
→ mmap() for stack
→ Tcb::new() for TLS
→ stack setup with entry shim
→ Sys::rlct_clone(stack, os_specific)
→ redox_rt::clone()
→ proc: scheme -> kernel clone
→ Context::new() (same owner_proc_id, shared addr_space)
→ context::spawn() (pushed to IDLE_CONTEXTS)
Key architectural points:
- Threads share the same
addr_space: Arc<AddrSpaceWrapper>(same page tables) - Threads share
files: Arc<LockedFdTbl>(same FD table) - Thread ownership via
owner_proc_id— but no formal thread group concept - No distinction between process and thread at kernel level — all are Contexts
pidis set once, notgid/tiddistinction
2.3 Futex Implementation
File: recipes/core/kernel/source/src/syscall/futex.rs
// Global hash table: PhysicalAddress → Vec<FutexEntry>
type FutexList = HashMap<PhysicalAddress, Vec<FutexEntry>>;
static FUTEXES: Mutex<L1, FutexList> = ...;
pub struct FutexEntry {
target_virtaddr: VirtualAddress,
context_lock: Arc<ContextLock>,
addr_space: Weak<AddrSpaceWrapper>, // For CoW safety
}
Supported operations:
| Op | Status | Notes |
|---|---|---|
FUTEX_WAIT (32-bit) |
✅ | Validates alignment (4-byte), checks value, blocks |
FUTEX_WAIT64 (64-bit) |
✅ | x86_64 only, checks alignment (8-byte) |
FUTEX_WAKE |
✅ | Wakes up to val waiters, O(n) scan by virtual address matching |
NOT supported (critical gaps):
| Op | Impact |
|---|---|
FUTEX_REQUEUE |
Cannot move waiters between futexes — needed by condvar broadcast |
FUTEX_CMP_REQUEUE |
Cannot atomically compare-and-requeue — race condition risk |
FUTEX_WAKE_OP |
Cannot do atomic op + wake — needed by glibc mutex fast path |
FUTEX_LOCK_PI |
No priority inheritance — PTHREAD_PRIO_INHERIT is a stub |
FUTEX_TRYLOCK_PI |
No trylock with PI |
FUTEX_UNLOCK_PI |
No unlock with PI |
FUTEX_CMP_REQUEUE_PI |
No requeue with PI |
FUTEX_WAIT_BITSET |
No bitset wait — needed for pselect/ppoll optimization |
FUTEX_WAKE_BITSET |
No bitset wake |
FUTEX_WAIT_MULTIPLE |
As noted in code TODO, not implemented |
FUTEX_PRIVATE flag |
Conceptual TODO in code comment — "implement fully in userspace" |
Performance concerns:
- Global
FUTEXESmutex — all futex operations on all CPUs contend on a single L1 lock - O(n) wake scan —
FUTEX_WAKEiterates all entries for a physical address to match by virtual address - Full
HashMapentry removal — on wake, entry isswap_remove'd; on last waiter, the entireHashMapentry is removed (churn) - No per-process futex isolation — all futexes share the same global table, even process-private ones
- No wait-multiple — waking multiple independent futexes requires multiple syscalls
2.4 relibc pthread Completeness
Files: src/pthread/mod.rs, src/header/pthread/*.rs, src/header/sched/mod.rs
| API Surface | Status | Notes |
|---|---|---|
pthread_create / pthread_join / pthread_detach |
✅ Full | Stack via mmap, TLS init, waitval for join |
pthread_mutex_* (normal, recursive, errorcheck) |
✅ Full | Internal implementation in src/sync/ |
pthread_cond_* |
✅ Full | Condition variables present |
pthread_rwlock_* |
✅ Full | Read-write locks present |
pthread_barrier_* |
✅ Full | Barriers present |
pthread_spin_* |
✅ Full | Spinlocks present |
pthread_key_* / TLS |
✅ Full | Thread-local storage with destructors |
pthread_once |
✅ Full | call_once pattern |
pthread_cancel / pthread_setcancelstate / pthread_setcanceltype |
✅ Full | Deferred + async cancellation via RT signal |
pthread_attr_* (init/destroy/get/set) |
✅ Full | All attribute accessors implemented |
pthread_getattr_np |
✅ Partial | Stack base/size returned; other attrs default |
pthread_setname_np / pthread_getname_np |
✅ Delivered | Kernel proc: Name handle + relibc wrapper |
pthread_attr_setschedpolicy |
🚧 Accepts value, kernel ignores | Kernel pays no attention to policy |
pthread_attr_setschedparam |
🚧 Accepts value, kernel ignores | sched_priority stored but unused |
pthread_setschedparam |
🚧 No-op | set_sched_param() — TODO comment |
pthread_setschedprio |
🚧 No-op | set_sched_priority() — TODO comment |
pthread_mutexattr_setprotocol |
🚧 Stub | PTHREAD_PRIO_INHERIT accepted but no-op |
pthread_mutexattr_setrobust |
🚧 Stub | PTHREAD_MUTEX_ROBUST accepted but no-op |
pthread_mutexattr_setpshared |
🚧 Partial | PROCESS_SHARED constant exists; futex supports cross-AS |
pthread_getcpuclockid |
🚧 ENOENT | get_cpu_clkid() returns ENOENT |
pthread_kill |
⚠️ Failing | Failing tests (child/invalid/self) — race condition noted at signal/mod.rs:178 |
pthread_atfork |
❌ Empty stubs | Registered handlers exist but are no-ops — fork is NOT thread-safe |
pthread_sigmask |
✅ | Via sigprocmask |
pthread_atfork |
✅ | fork hooks present |
| sched.h functions: | ||
sched_yield |
✅ | Via Sys::sched_yield() |
sched_get_priority_max |
🚧 todo!() |
|
sched_get_priority_min |
🚧 todo!() |
|
sched_getparam |
🚧 todo!() |
|
sched_setparam |
🚧 todo!() |
|
sched_setscheduler |
🚧 todo!() |
|
sched_rr_get_interval |
🚧 todo!() |
2.5 IPC Primitives Relevant to Multithreading
From KERNEL-IPC-CREDENTIAL-PLAN.md and direct code review:
| Primitive | Kernel Support | Threading Impact |
|---|---|---|
| Futex | WAIT/WAKE only | Critical — base primitive for all userspace sync |
| Shared memory (shm/mmap MAP_SHARED) | ✅ Via memory scheme | Required for PTHREAD_PROCESS_SHARED |
| Signals (per-thread) | ✅ Via proc: scheme | Thread cancellation, SIGEV_THREAD |
Pipe (kernel pipe: scheme) |
✅ | Thread communication |
| eventfd/signalfd/timerfd | ✅ Recipe-applied | Async I/O notification |
| SysV sem/shm | ✅ Recipe-activated (2026-04-29) | Qt QSystemSemaphore |
| POSIX msg queues | ❌ Missing | Low priority for desktop |
| SysV msg queues | ❌ Missing | Low priority for desktop |
3. Critical Gaps and Blockers
3.1 Priority Gaps (Blocking Desktop Responsiveness)
| # | Gap | Impact | Blocked Consumer |
|---|---|---|---|
| G1 | No SCHED_RR/SCHED_FIFO | All threads treated equally; input/audio threads can't get priority | KWin input thread, PulseAudio |
| G2 | No dynamic priority | CPU-bound threads aren't penalized; I/O-bound threads aren't boosted | Desktop compositor under load |
| G3 | No PI futexes | Priority inversion: low-priority thread holding mutex blocks high-priority waiter | KWin compositor lock, Qt mutexes |
| G4 | No pthread_setschedparam |
Applications can't request scheduling policy changes | All desktop apps |
| G5 | No timeslice differentiation | High-priority threads get same quantum as low-priority | Poor latency for foreground tasks |
3.2 Scalability Gaps (Blocking Many-Core Performance)
| # | Gap | Impact |
|---|---|---|
| G6 | No work stealing | CPUs go idle while work exists on other CPUs |
| G7 | No load balancing | New threads stay on creator CPU; imbalance builds over time |
| G8 | Global context switch lock | Serialization bottleneck beyond ~8 cores |
| G9 | Global futex mutex | All cores contend on single L1 lock for futex ops |
| G10 | O(n) idle wake scan | Linear scan proportional to total sleeping threads |
| G11 | No NUMA awareness | Cross-node memory access penalty on multi-socket systems |
3.3 Correctness Gaps (Blocking Robust Applications)
| # | Gap | Impact |
|---|---|---|
| G12 | No robust mutexes | Thread death while holding mutex → permanent deadlock |
| G13 | No FUTEX_REQUEUE | Condvar broadcast wakes all waiters → thundering herd |
| G14 | No thread groups (tgid) | kill(pid, sig) can't target a process; getpid() per thread context |
| G15 | Static-only sched_affinity | No userspace CPU pinning API |
| G16 | No setpriority/getpriority | POSIX nice values not wired to kernel priority |
| G17 | pthread barriers hang on SMP | check.sh runs -smp 1 to work around barrier/once hang on multi-core QEMU — blocks KWin GPU barrier sync |
| G18 | pthread_kill race condition | All four pthread_kill tests (child/invalid/self/kill0) are failing — thread-targeted signal delivery unreliable |
| G19 | fork() thread-unsafe | pthread_atfork handlers are empty no-ops; child inherits locked mutexes from parent |
| G20 | Linux aarch64 rlct_clone stub | todo!("rlct_clone not implemented for aarch64 yet") — blocks aarch64 builds |
4. Implementation Plan
Phase S1: Scheduler Observability and Metrics (Week 1-2)
Goal: Add instrumentation to measure and understand scheduling behavior before optimizing.
S1.1 — Per-context scheduling statistics
Add to Context struct:
pub struct Context {
// NEW scheduling statistics:
pub sched_run_count: u64, // Times this context was scheduled
pub sched_wait_time: u128, // Total time spent waiting (accumulated)
pub sched_last_wake: u128, // Timestamp of last unblock
pub sched_migrations: u32, // Times migrated between CPUs
pub sched_preemptions: u32, // Times preempted
pub sched_voluntary_switch: u32, // Times yielded/blocked voluntarily
}
Files: context/context.rs — add fields, initialize in Context::new(), update in switch()
S1.2 — Per-CPU scheduler metrics
Add to cpu_stats.rs:
pub struct CpuStats {
// Existing: user, nice, kernel, idle, irq
// NEW:
pub sched_scans: AtomicU64, // number of select_next_context calls
pub sched_empty_scans: AtomicU64, // scans that found no runnable context
pub sched_steals: AtomicU64, // work stolen from other CPUs (future)
pub sched_ipi_wakeups: AtomicU64, // wakeups via IPI
pub sched_max_queue_depth: AtomicU64, // maximum queue depth observed
}
S1.3 — /scheme/sys/sched debug interface
Expose scheduler metrics via a new kernel scheme path:
scheme:sys/sched/runqueues — per-CPU run queue depths
scheme:sys/sched/top — top-N contexts by recent CPU time
scheme:sys/sched/context/{id} — per-context scheduling stats
This enables redbear-info or a new redbear-sched tool for runtime diagnostics.
S1.4 — relibc sched_getscheduler() baseline
Wire sched_getscheduler() to return SCHED_OTHER (the current DWRR is closest to SCHED_OTHER):
// relibc/src/header/sched/mod.rs
pub extern "C" fn sched_getscheduler(pid: pid_t) -> c_int {
// For now: all processes use SCHED_OTHER (DWRR)
SCHED_OTHER
}
Patch: local/patches/relibc/P5-sched-observe.patch
Phase S2: Real-Time Scheduling Support (Week 2-4)
Goal: Add SCHED_FIFO and SCHED_RR scheduling classes to the kernel, and wire relibc sched_setscheduler().
S2.1 — Scheduling policy in Context
Add to Context:
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum SchedPolicy {
Other, // DWRR (current default)
Fifo, // Strict priority, no preemption within same priority
RoundRobin, // Strict priority, round-robin within same priority
// Future:
// Batch, // Throughput-optimized, lower priority than Other
// Idle, // Only runs when absolutely nothing else is runnable
}
pub struct Context {
pub sched_policy: SchedPolicy, // NEW
pub sched_rt_priority: u8, // NEW: 0-99 RT priority
// Renamed: prio → sched_dynamic_prio (for SCHED_OTHER)
pub sched_dynamic_prio: usize,
pub sched_static_prio: usize, // NEW: base priority, unmodified by heuristics
}
Initialization: Default sched_policy = SchedPolicy::Other, sched_rt_priority = 0.
S2.2 — Priority mapping
RT priority 99 → kernel prio 0 (highest)
RT priority 98 → kernel prio 1
...
RT priority 0 → kernel prio 39 (lowest RT, still above SCHED_OTHER)
SCHED_OTHER:
nice -20 → kernel prio 0 (still below RT 0)
nice 0 → kernel prio 20 (default)
nice +19 → kernel prio 39
SCHED_FIFO within same RT priority: no preemption (runs until blocks)
SCHED_RR within same RT priority: round-robin with configurable quantum
S2.3 — Scheduler dispatch by policy
Modify select_next_context() to prioritize:
SCHED_FIFOcontexts (highest RT priority first, no preemption per priority)SCHED_RRcontexts (highest RT priority first, round-robin per priority)SCHED_OTHERcontexts (existing DWRR)
fn select_next_context(...) -> ... {
// PASS 1: SCHED_FIFO — first runnable at highest priority wins
for prio in 0..40 {
if let Some(fifo_ctx) = take_first_runnable_of_policy(
prio, SchedPolicy::Fifo, &mut contexts_list
) {
return Ok(Some(fifo_ctx));
}
}
// PASS 2: SCHED_RR — round-robin within priority
for prio in 0..40 {
if let Some(rr_ctx) = take_next_rr_of_policy(
prio, &mut contexts_list, &mut percpu.rr_position[prio]
) {
return Ok(Some(rr_ctx));
}
}
// PASS 3: SCHED_OTHER — existing DWRR (unchanged)
existing_dwrr_logic(...)
}
S2.4 — SCHED_RR timeslice configuration
Add per-context timeslice for SCHED_RR:
pub struct Context {
pub sched_rr_quantum: u128, // nanoseconds, default 100ms
}
Override the 3-tick quantum for SCHED_RR contexts: track ticks consumed, preempt at quantum.
S2.5 — syscall interface for policy changes
Add kernel syscall or extend proc: scheme:
proc: scheme command: SetSchedPolicy(pid, policy, rt_priority)
S2.6 — Wire relibc sched_setscheduler()
// relibc/src/header/sched/mod.rs
pub extern "C" fn sched_setscheduler(
pid: pid_t, policy: c_int, param: *const sched_param,
) -> c_int {
let prio = unsafe { (*param).sched_priority };
let kernel_policy = match policy {
SCHED_FIFO => SchedPolicyRequest::Fifo,
SCHED_RR => SchedPolicyRequest::RoundRobin,
SCHED_OTHER => SchedPolicyRequest::Other,
_ => return set_errno(EINVAL),
};
// Send to kernel via proc: scheme
Sys::set_sched_policy(pid, kernel_policy, prio)
}
Patches:
local/patches/kernel/P5-sched-policy.patch— Context fields + sched dispatchlocal/patches/kernel/P5-sched-policy-proc.patch— proc: scheme SetSchedPolicylocal/patches/relibc/P5-sched-setscheduler.patch— wire through schemelocal/patches/relibc/P5-sched-getscheduler.patch— return current policylocal/patches/relibc/P5-sched-priority.patch— sched_get/setparam
Phase S3: Load Balancing and Work Stealing (Week 4-6)
Status: ✅ COMPLETE (2026-04-30) — P3.1 PerCpuSched struct + P3.2 per-CPU wiring + P3.3 work stealing + P3.4 initial placement (least-loaded CPU) + P3.5 periodic load balancing all implemented.
Goal: Distribute runnable contexts across CPUs to maximize utilization.
S3.1 — Per-CPU run queue lock elimination
Replace the global RUN_CONTEXTS: Mutex<L1, RunContextData> with per-CPU run queues:
// In PercpuBlock:
pub struct PerCpuSched {
pub run_queues: [VecDeque<WeakContextRef>; 40],
pub run_queues_lock: SpinLock, // per-CPU, low contention
pub balance: [usize; 40],
pub last_queue: usize,
pub idle_context: Arc<ContextLock>,
}
This eliminates the global L1 mutex bottleneck for dequeue operations.
S3.2 — Idle CPU work stealing
When select_next_context() finds no runnable context on the local CPU:
- Pick a victim CPU (round-robin or random)
- Lock victim's run queues
- Dequeue the highest-priority runnable context
- Return it for scheduling
fn steal_work(percpu: &PercpuBlock, cpu_id: LogicalCpuId) -> Option<ArcContextLockWriteGuard> {
for victim_offset in 1..cpu_count() {
let victim_id = (cpu_id + victim_offset) % cpu_count();
let victim_percpu = percpu_for(victim_id);
// Try to steal from highest priority queues first
for prio in 0..40 {
if let Some(ctx) = victim_percpu.dequeue_runnable(prio) {
percpu.stats.sched_steals.fetch_add(1, Ordering::Relaxed);
return Some(ctx);
}
}
}
None
}
S3.3 — Initial placement (fork/exec balance)
When creating a new context, instead of always going to the creating CPU's idle queue:
fn place_new_context(ctx: &mut Context) -> LogicalCpuId {
// Pick the CPU with the shortest total run queue
let target = cpus()
.min_by_key(|cpu| cpu.total_runnable_contexts())
.unwrap_or(crate::cpu_id());
ctx.sched_affinity = LogicalCpuSet::single(target);
target
}
S3.4 — Periodic load balancing
Add a periodic balancing trigger (e.g., every 100ms or when queue depth difference exceeds threshold):
fn balance_load() {
let avg_depth = average_runnable_per_cpu();
for cpu in overloaded_cpus(avg_depth * 1.25) {
let target = most_idle_cpu();
migrate_contexts(cpu, target, cpu.total_runnable() - avg_depth);
}
}
Patches:
local/patches/kernel/P6-percpu-runqueues.patch— per-CPU run queues (infrastructure)
Phase S4: Futex Enhancements (Week 6-9)
Status: ✅ COMPLETE (2026-04-30) — S4.1 futex sharding (64-shard), S4.2 FUTEX_REQUEUE, S4.3 PI futex, S4.4 robust futex, vruntime tracking, minimum-vruntime selection all implemented.
Goal: Add PI, requeue, and per-futex locking to support robust desktop mutex performance.
S4.1 — Per-futex locking (reduce global contention)
Replace the single FUTEXES: Mutex<L1, FutexList> with a sharded hash table:
const FUTEX_SHARDS: usize = 64; // or scale with CPU count
static FUTEXES: [Mutex<FutexList>; FUTEX_SHARDS] = ...;
fn futex_shard(phys: PhysicalAddress) -> usize {
phys.data() as usize % FUTEX_SHARDS
}
S4.2 — FUTEX_REQUEUE and FUTEX_CMP_REQUEUE
fn futex_requeue(
addr1: PhysicalAddress, // source futex
addr2: PhysicalAddress, // target futex
val: usize, // max to requeue
val2: usize, // expected value (for CMP_REQUEUE)
cmp: bool, // whether to compare first
) -> Result<usize> {
// Atomically move up to `val` waiters from addr1's wait queue to addr2's
// If cmp is true, only proceed if *addr1 == val2
}
This is critical for condition variable performance — without it, pthread_cond_broadcast causes a thundering herd where every waiter wakes, rechecks, and most re-block.
S4.3 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI / FUTEX_CMP_REQUEUE_PI)
Priority inheritance for futexes:
pub struct PiState {
owner: Option<Arc<ContextLock>>,
waiters: Vec<(Arc<ContextLock>, u32)>, // (context, original_priority)
}
// When a high-priority context blocks on a PI futex held by a low-priority context:
fn pi_boost(owner: &mut Context, waiter_prio: usize) {
if waiter_prio < owner.sched_dynamic_prio {
owner.sched_dynamic_prio = waiter_prio;
owner.pi_boosted = true;
}
}
Critical path: KWin compositor lock. Without PI, a low-priority background thread holding a mutex that the compositor thread needs can block rendering for an unbounded time.
S4.4 — Robust Futexes
Mark futex waiters in a robust_list so the kernel can unlock them on thread death:
pub struct RobustListEntry {
futex_addr: usize,
futex_len: usize,
// List is per-thread, registered via set_robust_list syscall
}
On exit_thread():
fn wake_robust_futexes(context: &Context) {
for entry in &context.robust_list {
// Set FUTEX_OWNER_DIED bit
// Wake one waiter with EOWNERDEAD
}
}
Patches:
local/patches/kernel/P6-futex-sharding.patch— futex lock sharding (delivered)- (PI futex, requeue, robust futex deferred)
Phase S5: Dynamic Priority and Thread Management (Week 9-11)
Status: ✅ COMPLETE (2026-04-30) — S5.1 vruntime + S5.2 setpriority/getpriority + S5.3 pthread_setaffinity_np + S5.4 pthread_setname_np + pthread_setschedparam (Redox) all implemented.
Goal: Add I/O-vs-CPU heuristics, CPU affinity API, and thread naming.
S5.1 — Dynamic priority adjustment (SCHED_OTHER)
Implement a simplified CFS-style virtual runtime tracking:
pub struct Context {
pub vruntime: u128, // Virtual runtime (weighted by priority)
}
// On context switch OUT:
prev_context.vruntime += actual_runtime * SCHED_PRIO_TO_WEIGHT[default_prio]
/ SCHED_PRIO_TO_WEIGHT[prev_context.sched_static_prio];
// On select_next_context for SCHED_OTHER:
// Pick context with lowest vruntime instead of DWRR deficit tracking
This automatically penalizes CPU-bound threads (their vruntime grows faster) and favors I/O-bound threads (they sleep, vruntime stays low).
S5.2 — POSIX nice values
Map nice(-20..+19) to static priorities:
fn nice_to_static_prio(nice: i8) -> usize {
// nice -20 → kernel prio 0 (SCHED_OTHER range)
// nice 0 → kernel prio 20
// nice +19 → kernel prio 39
((nice + 20) as usize).clamp(0, 39)
}
// Wire setpriority/getpriority to modify sched_static_prio
S5.3 — CPU affinity API
Add to proc: scheme:
proc: scheme command: SetAffinity(pid, affinity_mask: u64)
proc: scheme command: GetAffinity(pid) → u64
Wire in relibc:
pub extern "C" fn pthread_setaffinity_np(
thread: pthread_t, cpusetsize: size_t, cpuset: *const cpu_set_t,
) -> c_int {
let mask = unsafe { read_cpu_set(cpuset, cpusetsize) };
Sys::set_cpu_affinity(tid, mask)
}
S5.4 — Thread naming API
The kernel Context.name field already exists (32-char ArrayString). Wire it:
// proc: scheme command: SetName(pid, name)
// relibc:
pub extern "C" fn pthread_setname_np(thread: pthread_t, name: *const c_char) -> c_int {
let name = unsafe { CStr::from_ptr(name) };
Sys::set_thread_name(thread.os_tid, name)
}
Patches:
local/patches/kernel/P6-vruntime-context.patch— vruntime field + initializationlocal/patches/kernel/P6-vruntime-switch.patch— weighted update + min-vruntime selectionlocal/patches/kernel/P7-cache-affine-context.patch— cache-affine scheduling (last_cpu)local/patches/kernel/P7-cache-affine-switch.patch— cache-affine vruntime bonuslocal/patches/kernel/P7-proc-setpriority.patch— setpriority proc handlelocal/patches/kernel/P7-proc-setname.patch— thread naming proc handlelocal/patches/relibc/P7-setpriority.patch— setpriority/getprioritylocal/patches/relibc/P7-pthread-affinity.patch— pthread_setaffinity_nplocal/patches/relibc/P7-pthread-setname.patch— pthread_setname_np
Phase S6: NUMA and Cache-Affine Scheduling (Week 11-13)
Status: ✅ DELIVERED (2026-04-30) — S6.3 cache-affine scheduling + S6.1 NUMA topology kernel hints implemented. NUMA discovery (SRAT/SLIT parsing) is userspace responsibility (numad daemon via /scheme/acpi/). Kernel stores lightweight NumaTopology for O(1) scheduling lookups. Full userspace numad daemon is follow-up work.
Goal: Optimize for multi-socket systems by keeping related threads near their memory.
S6.1 — NUMA topology discovery
Parse ACPI SRAT/SLIT tables (already available in ACPI infrastructure):
pub struct NumaTopology {
nodes: Vec<NumaNode>,
distances: Vec<Vec<u8>>, // SLIT inter-node distances
}
pub struct NumaNode {
id: u8,
cpus: LogicalCpuSet,
memory: PhysicalMemoryRange,
}
S6.2 — NUMA-aware initial placement
When creating a new context:
- If parent thread has
sched_affinity, prefer CPUs in the same NUMA node - Otherwise, pick the NUMA node with the most free memory
S6.3 — Cache-affine scheduling
Track the last CPU a context ran on. Prefer to re-schedule on the same CPU to avoid cache migration penalty:
pub struct Context {
pub sched_last_cpu: LogicalCpuId, // already tracked via cpu_id before it becomes None
}
In select_next_context():
// When scanning runnable contexts, prefer those whose last_cpu == current_cpu_id
// (hot cache) over those from other CPUs (cold cache)
let hot_ctx = search_for_hot_context(current_cpu, &queues);
let fallback = search_for_cold_context(&queues);
hot_ctx.or(fallback)
Patches:
local/patches/kernel/P7-cache-affine-context.patch— cache-affine scheduling (delivered)local/patches/kernel/P7-cache-affine-switch.patch— cache-affine vruntime bonus (delivered)- (NUMA SRAT/SLIT parsing deferred)
Phase R1: relibc POSIX Scheduling API Completion (Week 2-4, parallel with S2)
Goal: Fill all todo!() stubs in sched.h and pthread.h scheduling functions.
| Function | Implementation |
|---|---|
sched_get_priority_max(policy) |
Return 99 for FIFO/RR, 0 for OTHER |
sched_get_priority_min(policy) |
Return 1 for FIFO/RR, 0 for OTHER |
sched_getparam(pid, param) |
Query kernel for current RT priority |
sched_setparam(pid, param) |
Delegate to sched_setscheduler with current policy |
sched_getscheduler(pid) |
Query kernel for current policy |
sched_rr_get_interval(pid, tp) |
Return SCHED_RR quantum (default 100ms) |
pthread_setschedparam(thread, policy, param) |
Set kernel sched policy via proc: scheme |
pthread_getschedparam(thread, policy, param) |
Get kernel sched policy |
pthread_setschedprio(thread, prio) |
Set dynamic priority within current policy |
pthread_getcpuclockid(thread, clock_id) |
Return CPU-time clock for thread |
Patches: All in local/patches/relibc/P5-sched-complete.patch
Phase R2: Robust and PI Mutex Support (Week 5-9, parallel with S4)
Goal: Full POSIX mutex robustness and priority inheritance.
R2.1 — PI mutex protocol
// relibc/src/sync/pthread_mutex.rs
pub struct PthreadMutex {
futex: AtomicU32,
owner: AtomicUsize, // os_tid of current owner
pi_waiters: Mutex<Vec<(OsTid, u32)>>, // waiters with requested priority
flags: AtomicU32, // PTHREAD_PRIO_INHERIT, PTHREAD_MUTEX_ROBUST
}
// Lock with PI:
fn lock_pi(&self) -> Result<(), Errno> {
loop {
match futex::lock_pi(&self.futex) {
Ok(()) => {
self.owner.store(current_tid(), Ordering::Release);
return Ok(());
}
Err(EAGAIN) => continue,
Err(err) => return Err(err),
}
}
}
R2.2 — Robust mutex protocol
pub struct RobustList {
head: *mut RobustListHead,
}
pub struct RobustListHead {
list: RobustList,
futex_offset: isize,
pending: *mut RobustListHead,
}
// On thread exit:
fn handle_robust_list(thread: &Pthread) {
for entry in thread.robust_list.iter() {
let futex_addr = (entry as usize + entry.futex_offset) as *mut AtomicU32;
// Set FUTEX_OWNER_DIED
futex_addr.fetch_or(FUTEX_OWNER_DIED, Ordering::Release);
// Wake one waiter with EOWNERDEAD
futex::wake(futex_addr, 1);
}
}
Phase R3: Thread Groups and Process Identity (Week 10-12)
Goal: Proper tgid/pid distinction, kill(pid, 0) process targeting.
R3.1 — Kernel thread group concept
pub struct Context {
pub tgid: usize, // Thread Group ID (= pid for main thread)
pub tid: usize, // Thread ID (unique per thread)
}
- On
clone(CLONE_THREAD): child gets same tgid as parent, new tid - On fork: child gets new tgid = child's tid
getpid()returns tgidgettid()returns tidkill(tgid, sig)delivers signal to all threads in thread group
R3.2 — Thread group signal delivery
fn deliver_signal_to_thread_group(tgid: usize, sig: Signal) {
for context in contexts_in_thread_group(tgid) {
// Pick a thread that hasn't blocked this signal
if !context.sig_blocked(sig) {
context.deliver_signal(sig);
break;
}
}
}
Patches:
local/patches/kernel/P5-tgid.patch— thread group ID kernel supportlocal/patches/kernel/P5-tgid-signal.patch— process-targeted signal deliverylocal/patches/relibc/P5-gettid.patch— gettid() syscall
5. Dependency Chain
Phase S1 (observability)
│
├──► Phase S2 (real-time scheduling) ────┐
│ │ │
│ ├──► Phase R1 (POSIX sched API) │
│ │ │
│ └──► KWin input thread priority │
│ │
├──► Phase S3 (load balancing) ───────────┤
│ │ │
│ └──► Mesa worker thread scaling │
│ │
├──► Phase S4 (futex enhancements) ───────┤
│ │ │
│ ├──► Phase R2 (PI/robust mutex) │
│ │ │
│ └──► KWin compositor lock │
│ │
├──► Phase S5 (dynamic prio + affinity) ──┤
│ │ │
│ └──► Application CPU pinning │
│ │
├──► Phase R3 (thread groups) ────────────┤
│ │ │
│ └──► process-targeted signals │
│ │
└──► Phase S6 (NUMA) ─────────────────────┘
│
└──► Multi-socket server performance
Independent work (can run in parallel):
- S2 (RT scheduling) + R1 (POSIX sched API) — parallel
- S4 (futex) + R2 (PI/robust mutex) — parallel
- S3 (load balancing) can start after S1 but independently of S2
- S6 (NUMA) depends on S3 (per-CPU queues) but not on S4/S5
6. Integration with Existing Plans
| Existing Plan | Relationship |
|---|---|
KERNEL-IPC-CREDENTIAL-PLAN.md |
Sibling — this plan covers scheduler + futex + threading; that plan covers credentials + access control + IPC completeness |
RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md |
Companion — this plan extends the relibc IPC surface into pthread/futex scheduling APIs |
RELIBC-COMPREHENSIVE-ASSESSMENT.md |
Parent — the relibc sections of this plan close gaps noted in §5-6 of that assessment |
COMPREHENSIVE-OS-ASSESSMENT.md |
Parent — this plan closes §2 kernel gaps for scheduler/scalability |
CONSOLE-TO-KDE-DESKTOP-PLAN.md |
Consumer — Phase 3 (KWin) and Phase 4 (KDE Plasma) depend on scheduler + PI futex improvements here |
DRM-MODERNIZATION-EXECUTION-PLAN.md |
Sibling — GPU worker thread scheduling benefits from load balancing (S3) |
IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md |
Sibling — IRQ latency affects scheduling latency |
7. Patch Governance
All kernel and relibc source changes follow the durability policy (local/AGENTS.md):
local/patches/
├── kernel/
│ (Delivered: P6-* and P7-* patches below. P5-sched-* entries are planned future carriers.)
│ ├── P5-sched-observability.patch # S1
│ ├── P5-sched-policy.patch # S2
│ ├── P5-sched-policy-proc.patch # S2 proc: scheme
│ ├── P6-percpu-runqueues.patch # S3 (delivered: infrastructure)
│ ├── P6-futex-sharding.patch # S4 (delivered: sharding)
│ ├── P6-vruntime-context.patch # S5 (delivered: field + init)
│ ├── P6-vruntime-switch.patch # S5 (delivered: update + selection)
│ ├── (remaining S3-S6 patches deferred)
├── relibc/
│ ├── P5-sched-observe.patch # R1 baseline
│ ├── P5-sched-setscheduler.patch # R1
│ ├── P5-sched-getscheduler.patch # R1
│ ├── P5-sched-priority.patch # R1
│ ├── P5-sched-complete.patch # R1 remaining stubs
│ ├── (PI/robust mutex deferred) # R2
│ ├── P7-setpriority.patch # S5 (delivered)
│ ├── P7-pthread-affinity.patch # S5 (delivered)
│ └── P5-gettid.patch # R3
8. Validation and Evidence
8.1 Build Evidence
| Check | Command |
|---|---|
| Kernel compiles | make r.kernel |
| relibc compiles | make r.relibc |
| Full OS builds | make all CONFIG_NAME=redbear-full |
8.2 Runtime Evidence
| Test | Verification |
|---|---|
sched_getscheduler() returns policy |
redbear-info --sched |
pthread_setschedparam() changes priority |
Threaded test binary: test-sched-priority |
| RT thread preempts SCHED_OTHER | Latency test: RT thread wakes within 100μs |
| Work stealing across CPUs | redbear-info --sched shows balanced queue depths |
| PI futex prevents priority inversion | PI test: low-prio holder, high-prio waiter, medium-prio contester |
| Robust mutex recovery after thread kill | Robust test: kill thread holding mutex, verify EOWNERDEAD |
| Thread affinity pinning | taskset-like test: verify thread stays on assigned CPU |
| Load balancing on fork bomb | Spawn 2× CPUs threads, verify even distribution |
8.3 Verification Scripts
local/scripts/test-sched-qemu.sh # Scheduler metric validation
local/scripts/test-sched-rt-qemu.sh # Real-time scheduling proof
local/scripts/test-futex-pi-qemu.sh # PI futex proof
local/scripts/test-futex-robust-qemu.sh # Robust futex proof
local/scripts/test-sched-balance-qemu.sh # Load balancing proof (multi-vCPU)
9. Bottom Line
The Redox kernel scheduler is functional but simple — a correct DWRR implementation that works for a lightly-loaded system. For the KDE/Wayland desktop with dozens of competing threads (compositor, rendering, I/O, timers, D-Bus, input), it needs:
- Real-time scheduling (S2) — for audio and compositor input threads
- PI futexes (S4/R2) — to prevent the compositor lock from being inverted by background work
- Load balancing (S3) — to use all available cores efficiently
- Dynamic priority (S5) — to keep the compositor responsive under CPU load
These four items are the critical path to a responsive desktop. The remaining items (NUMA, thread groups, robust mutexes, affinity API) are important for correctness and server-class workloads but not desktop-blocking.
Total estimated effort: 13 weeks with 1-2 kernel developers, delivering incremental improvements at each phase boundary.