# Red Bear OS — Kernel Scheduler, Multithreading, and IPC Performance Improvement Plan **Date:** 2026-04-30 **Scope:** Kernel scheduler optimization, futex enhancements, multithreaded performance, relibc POSIX threading completeness **Status:** S3 complete (per-CPU + stealing + balancing + placement), S4 complete (futex sharding + REQUEUE + PI + robust + vruntime), S5 complete (setpriority + affinity + naming + schedparam), S6 partial (cache-affine delivered, NUMA deferred). This is the **canonical scheduler + multithreading authority**, extending `KERNEL-IPC-CREDENTIAL-PLAN.md` and `RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md` --- ## 1. Executive Summary The Redox microkernel currently uses a **Deficit Weighted Round Robin (DWRR)** scheduler with 40 static priority levels, per-CPU run queues, and cooperative preemption. The relibc C library provides a largely complete pthreads implementation, but POSIX scheduling APIs (`sched_*`, `pthread_setschedparam`) are stubbed out. For the KDE/Wayland desktop path, multithreaded performance bottlenecks in the scheduler and futex subsystem will become the dominant limitation once the compositor (KWin) and GPU rendering pipelines are active. ### Current State at a Glance | Area | Status | Key Gaps | |------|--------|----------| | Kernel scheduler | DWRR, 40 levels, vruntime selection for SCHED_OTHER, RT pass for FIFO/RR | Per-CPU run queues are infrastructure only; load balancing deferred | | Futex | WAIT/WAIT64/WAKE + 64-shard hash table | No PI, no requeue, no robust futex | | relibc pthreads | Create/join/detach/mutex/cond/rwlock/barrier/spin/tls | `sched_*` all `todo!()`, no PI/robust mutexes, no affinity API | | Thread management | proc: scheme clone/fork/exec | No dynamic priority, no CPU affinity from userspace, no thread groups | | IPC for threading | Futex, shared memory, signals | No process-shared robust/PI mutexes, no adaptive spinning | ### Why This Matters for the Desktop Path ``` KWin compositor (Qt6/QPA/Wayland) └── Worker threads: rendering, input, effects └── Requires: efficient futex wakeups, PI for compositor lock └── Requires: SCHED_RR for input thread priority Mesa GPU driver (LLVMpipe or hardware) └── Gallium worker threads: shader compilation, draw submission └── Requires: load-balanced scheduling across all CPUs └── Requires: non-contended futex performance Qt6 event loop └── Thread pool for QFuture/QtConcurrent └── Requires: SCHED_OTHER fair scheduling under load └── Requires: proper pthread_attr_setschedparam ``` --- ## 2. Current Architecture Assessment ### 2.1 Scheduler Architecture **File:** `recipes/core/kernel/source/src/context/switch.rs` **Algorithm:** Deficit Weighted Round Robin (DWRR) — documented at line 354: ```rust /// This is the scheduler function which currently utilises Deficit Weighted Round Robin Scheduler fn select_next_context(...) ``` **Key data structures** (from `context/mod.rs`): ```rust // 40 priority levels, each with its own queue pub struct RunContextData { set: [VecDeque; 40], } // Global lock for run queues (L1 = highest-level lock) static RUN_CONTEXTS: Mutex = ...; // Idle/sleeping contexts — scanned linearly on every tick static IDLE_CONTEXTS: Mutex> = ...; // All contexts (for enumeration) static CONTEXTS: RwLock> = ...; ``` **Priority weights** (geometric decay ~1.25x per level): ```rust const SCHED_PRIO_TO_WEIGHT: [usize; 40] = [ 88761, 71755, 56483, 46273, 36291, 29154, 23254, 18705, 14949, 11916, 9548, 7620, 6100, 4904, 3906, 3121, 2501, 1991, 1586, 1277, 1024, 820, 655, 526, 423, 335, 272, 215, 172, 137, 110, 87, 70, 56, 45, 36, 29, 23, 18, 15, ]; ``` **Time quantum:** 3 PIT ticks per context (~12.2ms). PIT channel 0 has divisor 4847 at 1.193182 MHz → ~4.062ms per tick. 3 ticks → ~12.2ms between context switches. The 6.75ms in the tick() comment is outdated. **Default priority:** 20 (middle of range). **Max scheduler iterations:** 5000 per `select_next_context` call (bail-out limit). **Per-CPU state:** `percpu.balance: [usize; 40]` (deficit counters), `percpu.last_queue` (round-robin position). **Preemption:** Preemptible unless `context.preempt_locks > 0` (guarded by `PreemptGuard` RAII wrappers). **Context switch lock:** Global `arch::CONTEXT_SWITCH_LOCK` — spinlock with `compare_exchange_weak` on `Ordering::SeqCst`. **Current limitations:** 1. **No real-time scheduling wired to userspace** — kernel has SchedPolicy enum and RT scheduling pass, but relibc `sched_setscheduler` returns ENOSYS for FIFO/RR until kernel wire-up is complete. 2. **No dynamic priority adjustment** — `context.prio` is set once and never changes. vruntime-based fairness compensates for SCHED_OTHER but no nice-value decay/boost. 3. **No work stealing** — each CPU only dequeues from its own queues. A CPU can go idle while another has backlog. 4. **No load balancing** — newly created contexts go to the creating CPU's idle queue. No migration across CPUs. 5. **O(n) idle wakeup scan** — `wakeup_contexts()` linearly scans the entire `IDLE_CONTEXTS` VecDeque on every tick (every ~2.25ms effective). 6. **Single global context switch lock** — `arch::CONTEXT_SWITCH_LOCK` serializes all CPU context switches on many-core systems. 7. **No NUMA awareness** — memory locality is not considered during scheduling. 8. **No timeslice scaling** — all contexts get the same 3-tick quantum regardless of priority (priority only affects how often they're picked, not how long they run). 9. **Large fixed iteration limit** — 5000 iterations per schedule attempt can cause latency spikes under heavy load. ### 2.2 Context/Thread Model **File:** `recipes/core/kernel/source/src/context/context.rs` ```rust pub struct Context { pub prio: usize, // Priority (0-39, default 20) pub status: Status, // Runnable / Blocked / HardBlocked / Dead pub running: bool, // Currently on a CPU pub cpu_id: Option,// Which CPU this context is on pub sched_affinity: LogicalCpuSet, // Allowed CPU set pub cpu_time: u128, // Accumulated CPU time (nanoseconds) pub switch_time: u128, // Last switch-in time pub wake: Option, // Wake timestamp for timed sleeps pub preempt_locks: usize, // Preemption disable counter pub kfx: AlignedBox, // SIMD/FPU save area pub addr_space: Option>, // Can be shared (threads) pub files: Arc, // Can be shared (same process threads) pub owner_proc_id: Option, // Parent process pub name: ArrayString<32>, // Human-readable name // Credentials: pub euid: u32, pub egid: u32, pub pid: usize, pub groups: Vec, // Supplementary groups } ``` **Thread creation flow:** ``` pthread_create() → relibc::pthread::create() → mmap() for stack → Tcb::new() for TLS → stack setup with entry shim → Sys::rlct_clone(stack, os_specific) → redox_rt::clone() → proc: scheme -> kernel clone → Context::new() (same owner_proc_id, shared addr_space) → context::spawn() (pushed to IDLE_CONTEXTS) ``` **Key architectural points:** - Threads share the same `addr_space: Arc` (same page tables) - Threads share `files: Arc` (same FD table) - Thread ownership via `owner_proc_id` — but no formal thread group concept - No distinction between process and thread at kernel level — all are Contexts - `pid` is set once, no `tgid`/`tid` distinction ### 2.3 Futex Implementation **File:** `recipes/core/kernel/source/src/syscall/futex.rs` ```rust // Global hash table: PhysicalAddress → Vec type FutexList = HashMap>; static FUTEXES: Mutex = ...; pub struct FutexEntry { target_virtaddr: VirtualAddress, context_lock: Arc, addr_space: Weak, // For CoW safety } ``` **Supported operations:** | Op | Status | Notes | |----|--------|-------| | `FUTEX_WAIT` (32-bit) | ✅ | Validates alignment (4-byte), checks value, blocks | | `FUTEX_WAIT64` (64-bit) | ✅ | x86_64 only, checks alignment (8-byte) | | `FUTEX_WAKE` | ✅ | Wakes up to `val` waiters, `O(n)` scan by virtual address matching | **NOT supported (critical gaps):** | Op | Impact | |----|--------| | `FUTEX_REQUEUE` | Cannot move waiters between futexes — needed by condvar broadcast | | `FUTEX_CMP_REQUEUE` | Cannot atomically compare-and-requeue — race condition risk | | `FUTEX_WAKE_OP` | Cannot do atomic op + wake — needed by glibc mutex fast path | | `FUTEX_LOCK_PI` | No priority inheritance — PTHREAD_PRIO_INHERIT is a stub | | `FUTEX_TRYLOCK_PI` | No trylock with PI | | `FUTEX_UNLOCK_PI` | No unlock with PI | | `FUTEX_CMP_REQUEUE_PI` | No requeue with PI | | `FUTEX_WAIT_BITSET` | No bitset wait — needed for `pselect`/`ppoll` optimization | | `FUTEX_WAKE_BITSET` | No bitset wake | | `FUTEX_WAIT_MULTIPLE` | As noted in code TODO, not implemented | | `FUTEX_PRIVATE` flag | Conceptual TODO in code comment — "implement fully in userspace" | **Performance concerns:** 1. **Global `FUTEXES` mutex** — all futex operations on all CPUs contend on a single L1 lock 2. **O(n) wake scan** — `FUTEX_WAKE` iterates all entries for a physical address to match by virtual address 3. **Full `HashMap` entry removal** — on wake, entry is `swap_remove`'d; on last waiter, the entire `HashMap` entry is removed (churn) 4. **No per-process futex isolation** — all futexes share the same global table, even process-private ones 5. **No wait-multiple** — waking multiple independent futexes requires multiple syscalls ### 2.4 relibc pthread Completeness **Files:** `src/pthread/mod.rs`, `src/header/pthread/*.rs`, `src/header/sched/mod.rs` | API Surface | Status | Notes | |-------------|--------|-------| | `pthread_create` / `pthread_join` / `pthread_detach` | ✅ Full | Stack via mmap, TLS init, waitval for join | | `pthread_mutex_*` (normal, recursive, errorcheck) | ✅ Full | Internal implementation in `src/sync/` | | `pthread_cond_*` | ✅ Full | Condition variables present | | `pthread_rwlock_*` | ✅ Full | Read-write locks present | | `pthread_barrier_*` | ✅ Full | Barriers present | | `pthread_spin_*` | ✅ Full | Spinlocks present | | `pthread_key_*` / TLS | ✅ Full | Thread-local storage with destructors | | `pthread_once` | ✅ Full | call_once pattern | | `pthread_cancel` / `pthread_setcancelstate` / `pthread_setcanceltype` | ✅ Full | Deferred + async cancellation via RT signal | | `pthread_attr_*` (init/destroy/get/set) | ✅ Full | All attribute accessors implemented | | `pthread_getattr_np` | ✅ Partial | Stack base/size returned; other attrs default | | `pthread_setname_np` / `pthread_getname_np` | ✅ Delivered | Kernel proc: Name handle + relibc wrapper | | `pthread_attr_setschedpolicy` | 🚧 Accepts value, kernel ignores | Kernel pays no attention to policy | | `pthread_attr_setschedparam` | 🚧 Accepts value, kernel ignores | `sched_priority` stored but unused | | `pthread_setschedparam` | 🚧 No-op | `set_sched_param()` — TODO comment | | `pthread_setschedprio` | 🚧 No-op | `set_sched_priority()` — TODO comment | | `pthread_mutexattr_setprotocol` | 🚧 Stub | PTHREAD_PRIO_INHERIT accepted but no-op | | `pthread_mutexattr_setrobust` | 🚧 Stub | PTHREAD_MUTEX_ROBUST accepted but no-op | | `pthread_mutexattr_setpshared` | 🚧 Partial | PROCESS_SHARED constant exists; futex supports cross-AS | | `pthread_getcpuclockid` | 🚧 ENOENT | `get_cpu_clkid()` returns ENOENT | | `pthread_kill` | ⚠️ Failing | Failing tests (child/invalid/self) — race condition noted at `signal/mod.rs:178` | | `pthread_atfork` | ❌ Empty stubs | Registered handlers exist but are no-ops — fork is NOT thread-safe | | `pthread_sigmask` | ✅ | Via `sigprocmask` | | `pthread_atfork` | ✅ | fork hooks present | | **sched.h functions:** | | | | `sched_yield` | ✅ | Via `Sys::sched_yield()` | | `sched_get_priority_max` | 🚧 `todo!()` | | | `sched_get_priority_min` | 🚧 `todo!()` | | | `sched_getparam` | 🚧 `todo!()` | | | `sched_setparam` | 🚧 `todo!()` | | | `sched_setscheduler` | 🚧 `todo!()` | | | `sched_rr_get_interval` | 🚧 `todo!()` | | ### 2.5 IPC Primitives Relevant to Multithreading From `KERNEL-IPC-CREDENTIAL-PLAN.md` and direct code review: | Primitive | Kernel Support | Threading Impact | |-----------|---------------|-----------------| | Futex | WAIT/WAKE only | **Critical** — base primitive for all userspace sync | | Shared memory (shm/mmap MAP_SHARED) | ✅ Via memory scheme | Required for PTHREAD_PROCESS_SHARED | | Signals (per-thread) | ✅ Via proc: scheme | Thread cancellation, SIGEV_THREAD | | Pipe (kernel `pipe:` scheme) | ✅ | Thread communication | | eventfd/signalfd/timerfd | ✅ Recipe-applied | Async I/O notification | | SysV sem/shm | ✅ Recipe-activated (2026-04-29) | Qt QSystemSemaphore | | POSIX msg queues | ❌ Missing | Low priority for desktop | | SysV msg queues | ❌ Missing | Low priority for desktop | --- ## 3. Critical Gaps and Blockers ### 3.1 Priority Gaps (Blocking Desktop Responsiveness) | # | Gap | Impact | Blocked Consumer | |---|-----|--------|-----------------| | G1 | **No SCHED_RR/SCHED_FIFO** | All threads treated equally; input/audio threads can't get priority | KWin input thread, PulseAudio | | G2 | **No dynamic priority** | CPU-bound threads aren't penalized; I/O-bound threads aren't boosted | Desktop compositor under load | | G3 | **No PI futexes** | Priority inversion: low-priority thread holding mutex blocks high-priority waiter | KWin compositor lock, Qt mutexes | | G4 | **No `pthread_setschedparam`** | Applications can't request scheduling policy changes | All desktop apps | | G5 | **No timeslice differentiation** | High-priority threads get same quantum as low-priority | Poor latency for foreground tasks | ### 3.2 Scalability Gaps (Blocking Many-Core Performance) | # | Gap | Impact | |---|-----|--------| | G6 | **No work stealing** | CPUs go idle while work exists on other CPUs | | G7 | **No load balancing** | New threads stay on creator CPU; imbalance builds over time | | G8 | **Global context switch lock** | Serialization bottleneck beyond ~8 cores | | G9 | **Global futex mutex** | All cores contend on single L1 lock for futex ops | | G10 | **O(n) idle wake scan** | Linear scan proportional to total sleeping threads | | G11 | **No NUMA awareness** | Cross-node memory access penalty on multi-socket systems | ### 3.3 Correctness Gaps (Blocking Robust Applications) | # | Gap | Impact | |---|-----|--------| | G12 | **No robust mutexes** | Thread death while holding mutex → permanent deadlock | | G13 | **No FUTEX_REQUEUE** | Condvar broadcast wakes all waiters → thundering herd | | G14 | **No thread groups (tgid)** | `kill(pid, sig)` can't target a process; `getpid()` per thread context | | G15 | **Static-only sched_affinity** | No userspace CPU pinning API | | G16 | **No setpriority/getpriority** | POSIX nice values not wired to kernel priority | | G17 | **pthread barriers hang on SMP** | `check.sh` runs `-smp 1` to work around barrier/once hang on multi-core QEMU — **blocks KWin GPU barrier sync** | | G18 | **pthread_kill race condition** | All four pthread_kill tests (child/invalid/self/kill0) are failing — thread-targeted signal delivery unreliable | | G19 | **fork() thread-unsafe** | `pthread_atfork` handlers are empty no-ops; child inherits locked mutexes from parent | | G20 | **Linux aarch64 rlct_clone stub** | `todo!("rlct_clone not implemented for aarch64 yet")` — **blocks aarch64 builds** | --- ## 4. Implementation Plan ### Phase S1: Scheduler Observability and Metrics (Week 1-2) **Goal:** Add instrumentation to measure and understand scheduling behavior before optimizing. #### S1.1 — Per-context scheduling statistics Add to `Context` struct: ```rust pub struct Context { // NEW scheduling statistics: pub sched_run_count: u64, // Times this context was scheduled pub sched_wait_time: u128, // Total time spent waiting (accumulated) pub sched_last_wake: u128, // Timestamp of last unblock pub sched_migrations: u32, // Times migrated between CPUs pub sched_preemptions: u32, // Times preempted pub sched_voluntary_switch: u32, // Times yielded/blocked voluntarily } ``` **Files:** `context/context.rs` — add fields, initialize in `Context::new()`, update in `switch()` #### S1.2 — Per-CPU scheduler metrics Add to `cpu_stats.rs`: ```rust pub struct CpuStats { // Existing: user, nice, kernel, idle, irq // NEW: pub sched_scans: AtomicU64, // number of select_next_context calls pub sched_empty_scans: AtomicU64, // scans that found no runnable context pub sched_steals: AtomicU64, // work stolen from other CPUs (future) pub sched_ipi_wakeups: AtomicU64, // wakeups via IPI pub sched_max_queue_depth: AtomicU64, // maximum queue depth observed } ``` #### S1.3 — `/scheme/sys/sched` debug interface Expose scheduler metrics via a new kernel scheme path: ``` scheme:sys/sched/runqueues — per-CPU run queue depths scheme:sys/sched/top — top-N contexts by recent CPU time scheme:sys/sched/context/{id} — per-context scheduling stats ``` This enables `redbear-info` or a new `redbear-sched` tool for runtime diagnostics. #### S1.4 — relibc `sched_getscheduler()` baseline Wire `sched_getscheduler()` to return `SCHED_OTHER` (the current DWRR is closest to SCHED_OTHER): ```rust // relibc/src/header/sched/mod.rs pub extern "C" fn sched_getscheduler(pid: pid_t) -> c_int { // For now: all processes use SCHED_OTHER (DWRR) SCHED_OTHER } ``` **Patch:** `local/patches/relibc/P5-sched-observe.patch` --- ### Phase S2: Real-Time Scheduling Support (Week 2-4) **Goal:** Add `SCHED_FIFO` and `SCHED_RR` scheduling classes to the kernel, and wire relibc `sched_setscheduler()`. #### S2.1 — Scheduling policy in Context Add to `Context`: ```rust #[derive(Clone, Copy, Debug, PartialEq, Eq)] pub enum SchedPolicy { Other, // DWRR (current default) Fifo, // Strict priority, no preemption within same priority RoundRobin, // Strict priority, round-robin within same priority // Future: // Batch, // Throughput-optimized, lower priority than Other // Idle, // Only runs when absolutely nothing else is runnable } pub struct Context { pub sched_policy: SchedPolicy, // NEW pub sched_rt_priority: u8, // NEW: 0-99 RT priority // Renamed: prio → sched_dynamic_prio (for SCHED_OTHER) pub sched_dynamic_prio: usize, pub sched_static_prio: usize, // NEW: base priority, unmodified by heuristics } ``` **Initialization:** Default `sched_policy = SchedPolicy::Other`, `sched_rt_priority = 0`. #### S2.2 — Priority mapping ``` RT priority 99 → kernel prio 0 (highest) RT priority 98 → kernel prio 1 ... RT priority 0 → kernel prio 39 (lowest RT, still above SCHED_OTHER) SCHED_OTHER: nice -20 → kernel prio 0 (still below RT 0) nice 0 → kernel prio 20 (default) nice +19 → kernel prio 39 SCHED_FIFO within same RT priority: no preemption (runs until blocks) SCHED_RR within same RT priority: round-robin with configurable quantum ``` #### S2.3 — Scheduler dispatch by policy Modify `select_next_context()` to prioritize: 1. `SCHED_FIFO` contexts (highest RT priority first, no preemption per priority) 2. `SCHED_RR` contexts (highest RT priority first, round-robin per priority) 3. `SCHED_OTHER` contexts (existing DWRR) ```rust fn select_next_context(...) -> ... { // PASS 1: SCHED_FIFO — first runnable at highest priority wins for prio in 0..40 { if let Some(fifo_ctx) = take_first_runnable_of_policy( prio, SchedPolicy::Fifo, &mut contexts_list ) { return Ok(Some(fifo_ctx)); } } // PASS 2: SCHED_RR — round-robin within priority for prio in 0..40 { if let Some(rr_ctx) = take_next_rr_of_policy( prio, &mut contexts_list, &mut percpu.rr_position[prio] ) { return Ok(Some(rr_ctx)); } } // PASS 3: SCHED_OTHER — existing DWRR (unchanged) existing_dwrr_logic(...) } ``` #### S2.4 — SCHED_RR timeslice configuration Add per-context timeslice for SCHED_RR: ```rust pub struct Context { pub sched_rr_quantum: u128, // nanoseconds, default 100ms } ``` Override the 3-tick quantum for SCHED_RR contexts: track ticks consumed, preempt at quantum. #### S2.5 — syscall interface for policy changes Add kernel syscall or extend `proc:` scheme: ``` proc: scheme command: SetSchedPolicy(pid, policy, rt_priority) ``` #### S2.6 — Wire relibc `sched_setscheduler()` ```rust // relibc/src/header/sched/mod.rs pub extern "C" fn sched_setscheduler( pid: pid_t, policy: c_int, param: *const sched_param, ) -> c_int { let prio = unsafe { (*param).sched_priority }; let kernel_policy = match policy { SCHED_FIFO => SchedPolicyRequest::Fifo, SCHED_RR => SchedPolicyRequest::RoundRobin, SCHED_OTHER => SchedPolicyRequest::Other, _ => return set_errno(EINVAL), }; // Send to kernel via proc: scheme Sys::set_sched_policy(pid, kernel_policy, prio) } ``` **Patches:** - `local/patches/kernel/P5-sched-policy.patch` — Context fields + sched dispatch - `local/patches/kernel/P5-sched-policy-proc.patch` — proc: scheme SetSchedPolicy - `local/patches/relibc/P5-sched-setscheduler.patch` — wire through scheme - `local/patches/relibc/P5-sched-getscheduler.patch` — return current policy - `local/patches/relibc/P5-sched-priority.patch` — sched_get/setparam --- ### Phase S3: Load Balancing and Work Stealing (Week 4-6) **Status: ✅ COMPLETE (2026-04-30)** — P3.1 PerCpuSched struct + P3.2 per-CPU wiring + P3.3 work stealing + P3.4 initial placement (least-loaded CPU) + P3.5 periodic load balancing all implemented. **Goal:** Distribute runnable contexts across CPUs to maximize utilization. #### S3.1 — Per-CPU run queue lock elimination Replace the global `RUN_CONTEXTS: Mutex` with per-CPU run queues: ```rust // In PercpuBlock: pub struct PerCpuSched { pub run_queues: [VecDeque; 40], pub run_queues_lock: SpinLock, // per-CPU, low contention pub balance: [usize; 40], pub last_queue: usize, pub idle_context: Arc, } ``` This eliminates the global L1 mutex bottleneck for dequeue operations. #### S3.2 — Idle CPU work stealing When `select_next_context()` finds no runnable context on the local CPU: 1. Pick a victim CPU (round-robin or random) 2. Lock victim's run queues 3. Dequeue the highest-priority runnable context 4. Return it for scheduling ```rust fn steal_work(percpu: &PercpuBlock, cpu_id: LogicalCpuId) -> Option { for victim_offset in 1..cpu_count() { let victim_id = (cpu_id + victim_offset) % cpu_count(); let victim_percpu = percpu_for(victim_id); // Try to steal from highest priority queues first for prio in 0..40 { if let Some(ctx) = victim_percpu.dequeue_runnable(prio) { percpu.stats.sched_steals.fetch_add(1, Ordering::Relaxed); return Some(ctx); } } } None } ``` #### S3.3 — Initial placement (fork/exec balance) When creating a new context, instead of always going to the creating CPU's idle queue: ```rust fn place_new_context(ctx: &mut Context) -> LogicalCpuId { // Pick the CPU with the shortest total run queue let target = cpus() .min_by_key(|cpu| cpu.total_runnable_contexts()) .unwrap_or(crate::cpu_id()); ctx.sched_affinity = LogicalCpuSet::single(target); target } ``` #### S3.4 — Periodic load balancing Add a periodic balancing trigger (e.g., every 100ms or when queue depth difference exceeds threshold): ```rust fn balance_load() { let avg_depth = average_runnable_per_cpu(); for cpu in overloaded_cpus(avg_depth * 1.25) { let target = most_idle_cpu(); migrate_contexts(cpu, target, cpu.total_runnable() - avg_depth); } } ``` **Patches:** - `local/patches/kernel/P6-percpu-runqueues.patch` — per-CPU run queues (infrastructure) --- ### Phase S4: Futex Enhancements (Week 6-9) **Status: ✅ COMPLETE (2026-04-30)** — S4.1 futex sharding (64-shard), S4.2 FUTEX_REQUEUE, S4.3 PI futex, S4.4 robust futex, vruntime tracking, minimum-vruntime selection all implemented. **Goal:** Add PI, requeue, and per-futex locking to support robust desktop mutex performance. #### S4.1 — Per-futex locking (reduce global contention) Replace the single `FUTEXES: Mutex` with a sharded hash table: ```rust const FUTEX_SHARDS: usize = 64; // or scale with CPU count static FUTEXES: [Mutex; FUTEX_SHARDS] = ...; fn futex_shard(phys: PhysicalAddress) -> usize { phys.data() as usize % FUTEX_SHARDS } ``` #### S4.2 — FUTEX_REQUEUE and FUTEX_CMP_REQUEUE ```rust fn futex_requeue( addr1: PhysicalAddress, // source futex addr2: PhysicalAddress, // target futex val: usize, // max to requeue val2: usize, // expected value (for CMP_REQUEUE) cmp: bool, // whether to compare first ) -> Result { // Atomically move up to `val` waiters from addr1's wait queue to addr2's // If cmp is true, only proceed if *addr1 == val2 } ``` This is critical for condition variable performance — without it, `pthread_cond_broadcast` causes a thundering herd where every waiter wakes, rechecks, and most re-block. #### S4.3 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI / FUTEX_CMP_REQUEUE_PI) Priority inheritance for futexes: ```rust pub struct PiState { owner: Option>, waiters: Vec<(Arc, u32)>, // (context, original_priority) } // When a high-priority context blocks on a PI futex held by a low-priority context: fn pi_boost(owner: &mut Context, waiter_prio: usize) { if waiter_prio < owner.sched_dynamic_prio { owner.sched_dynamic_prio = waiter_prio; owner.pi_boosted = true; } } ``` **Critical path:** KWin compositor lock. Without PI, a low-priority background thread holding a mutex that the compositor thread needs can block rendering for an unbounded time. #### S4.4 — Robust Futexes Mark futex waiters in a `robust_list` so the kernel can unlock them on thread death: ```rust pub struct RobustListEntry { futex_addr: usize, futex_len: usize, // List is per-thread, registered via set_robust_list syscall } ``` On `exit_thread()`: ```rust fn wake_robust_futexes(context: &Context) { for entry in &context.robust_list { // Set FUTEX_OWNER_DIED bit // Wake one waiter with EOWNERDEAD } } ``` **Patches:** - `local/patches/kernel/P6-futex-sharding.patch` — futex lock sharding (delivered) - (PI futex, requeue, robust futex deferred) --- ### Phase S5: Dynamic Priority and Thread Management (Week 9-11) **Status: ✅ COMPLETE (2026-04-30)** — S5.1 vruntime + S5.2 setpriority/getpriority + S5.3 pthread_setaffinity_np + S5.4 pthread_setname_np + pthread_setschedparam (Redox) all implemented. **Goal:** Add I/O-vs-CPU heuristics, CPU affinity API, and thread naming. #### S5.1 — Dynamic priority adjustment (SCHED_OTHER) Implement a simplified CFS-style virtual runtime tracking: ```rust pub struct Context { pub vruntime: u128, // Virtual runtime (weighted by priority) } // On context switch OUT: prev_context.vruntime += actual_runtime * SCHED_PRIO_TO_WEIGHT[default_prio] / SCHED_PRIO_TO_WEIGHT[prev_context.sched_static_prio]; // On select_next_context for SCHED_OTHER: // Pick context with lowest vruntime instead of DWRR deficit tracking ``` This automatically penalizes CPU-bound threads (their vruntime grows faster) and favors I/O-bound threads (they sleep, vruntime stays low). #### S5.2 — POSIX nice values Map `nice(-20..+19)` to static priorities: ```rust fn nice_to_static_prio(nice: i8) -> usize { // nice -20 → kernel prio 0 (SCHED_OTHER range) // nice 0 → kernel prio 20 // nice +19 → kernel prio 39 ((nice + 20) as usize).clamp(0, 39) } // Wire setpriority/getpriority to modify sched_static_prio ``` #### S5.3 — CPU affinity API Add to `proc:` scheme: ``` proc: scheme command: SetAffinity(pid, affinity_mask: u64) proc: scheme command: GetAffinity(pid) → u64 ``` Wire in relibc: ```rust pub extern "C" fn pthread_setaffinity_np( thread: pthread_t, cpusetsize: size_t, cpuset: *const cpu_set_t, ) -> c_int { let mask = unsafe { read_cpu_set(cpuset, cpusetsize) }; Sys::set_cpu_affinity(tid, mask) } ``` #### S5.4 — Thread naming API The kernel `Context.name` field already exists (32-char `ArrayString`). Wire it: ```rust // proc: scheme command: SetName(pid, name) // relibc: pub extern "C" fn pthread_setname_np(thread: pthread_t, name: *const c_char) -> c_int { let name = unsafe { CStr::from_ptr(name) }; Sys::set_thread_name(thread.os_tid, name) } ``` **Patches:** - `local/patches/kernel/P6-vruntime-context.patch` — vruntime field + initialization - `local/patches/kernel/P6-vruntime-switch.patch` — weighted update + min-vruntime selection - `local/patches/kernel/P7-cache-affine-context.patch` — cache-affine scheduling (last_cpu) - `local/patches/kernel/P7-cache-affine-switch.patch` — cache-affine vruntime bonus - `local/patches/kernel/P7-proc-setpriority.patch` — setpriority proc handle - `local/patches/kernel/P7-proc-setname.patch` — thread naming proc handle - `local/patches/relibc/P7-setpriority.patch` — setpriority/getpriority - `local/patches/relibc/P7-pthread-affinity.patch` — pthread_setaffinity_np - `local/patches/relibc/P7-pthread-setname.patch` — pthread_setname_np --- ### Phase S6: NUMA and Cache-Affine Scheduling (Week 11-13) **Status: ✅ DELIVERED (2026-04-30)** — S6.3 cache-affine scheduling + S6.1 NUMA topology kernel hints implemented. NUMA discovery (SRAT/SLIT parsing) is userspace responsibility (numad daemon via /scheme/acpi/). Kernel stores lightweight NumaTopology for O(1) scheduling lookups. Full userspace numad daemon is follow-up work. **Goal:** Optimize for multi-socket systems by keeping related threads near their memory. #### S6.1 — NUMA topology discovery Parse ACPI SRAT/SLIT tables (already available in ACPI infrastructure): ```rust pub struct NumaTopology { nodes: Vec, distances: Vec>, // SLIT inter-node distances } pub struct NumaNode { id: u8, cpus: LogicalCpuSet, memory: PhysicalMemoryRange, } ``` #### S6.2 — NUMA-aware initial placement When creating a new context: 1. If parent thread has `sched_affinity`, prefer CPUs in the same NUMA node 2. Otherwise, pick the NUMA node with the most free memory #### S6.3 — Cache-affine scheduling Track the last CPU a context ran on. Prefer to re-schedule on the same CPU to avoid cache migration penalty: ```rust pub struct Context { pub sched_last_cpu: LogicalCpuId, // already tracked via cpu_id before it becomes None } ``` In `select_next_context()`: ```rust // When scanning runnable contexts, prefer those whose last_cpu == current_cpu_id // (hot cache) over those from other CPUs (cold cache) let hot_ctx = search_for_hot_context(current_cpu, &queues); let fallback = search_for_cold_context(&queues); hot_ctx.or(fallback) ``` **Patches:** - `local/patches/kernel/P7-cache-affine-context.patch` — cache-affine scheduling (delivered) - `local/patches/kernel/P7-cache-affine-switch.patch` — cache-affine vruntime bonus (delivered) - (NUMA SRAT/SLIT parsing deferred) --- ### Phase R1: relibc POSIX Scheduling API Completion (Week 2-4, parallel with S2) **Goal:** Fill all `todo!()` stubs in `sched.h` and `pthread.h` scheduling functions. | Function | Implementation | |----------|---------------| | `sched_get_priority_max(policy)` | Return 99 for FIFO/RR, 0 for OTHER | | `sched_get_priority_min(policy)` | Return 1 for FIFO/RR, 0 for OTHER | | `sched_getparam(pid, param)` | Query kernel for current RT priority | | `sched_setparam(pid, param)` | Delegate to `sched_setscheduler` with current policy | | `sched_getscheduler(pid)` | Query kernel for current policy | | `sched_rr_get_interval(pid, tp)` | Return SCHED_RR quantum (default 100ms) | | `pthread_setschedparam(thread, policy, param)` | Set kernel sched policy via proc: scheme | | `pthread_getschedparam(thread, policy, param)` | Get kernel sched policy | | `pthread_setschedprio(thread, prio)` | Set dynamic priority within current policy | | `pthread_getcpuclockid(thread, clock_id)` | Return CPU-time clock for thread | **Patches:** All in `local/patches/relibc/P5-sched-complete.patch` --- ### Phase R2: Robust and PI Mutex Support (Week 5-9, parallel with S4) **Goal:** Full POSIX mutex robustness and priority inheritance. #### R2.1 — PI mutex protocol ```rust // relibc/src/sync/pthread_mutex.rs pub struct PthreadMutex { futex: AtomicU32, owner: AtomicUsize, // os_tid of current owner pi_waiters: Mutex>, // waiters with requested priority flags: AtomicU32, // PTHREAD_PRIO_INHERIT, PTHREAD_MUTEX_ROBUST } // Lock with PI: fn lock_pi(&self) -> Result<(), Errno> { loop { match futex::lock_pi(&self.futex) { Ok(()) => { self.owner.store(current_tid(), Ordering::Release); return Ok(()); } Err(EAGAIN) => continue, Err(err) => return Err(err), } } } ``` #### R2.2 — Robust mutex protocol ```rust pub struct RobustList { head: *mut RobustListHead, } pub struct RobustListHead { list: RobustList, futex_offset: isize, pending: *mut RobustListHead, } // On thread exit: fn handle_robust_list(thread: &Pthread) { for entry in thread.robust_list.iter() { let futex_addr = (entry as usize + entry.futex_offset) as *mut AtomicU32; // Set FUTEX_OWNER_DIED futex_addr.fetch_or(FUTEX_OWNER_DIED, Ordering::Release); // Wake one waiter with EOWNERDEAD futex::wake(futex_addr, 1); } } ``` --- ### Phase R3: Thread Groups and Process Identity (Week 10-12) **Goal:** Proper tgid/pid distinction, `kill(pid, 0)` process targeting. #### R3.1 — Kernel thread group concept ```rust pub struct Context { pub tgid: usize, // Thread Group ID (= pid for main thread) pub tid: usize, // Thread ID (unique per thread) } ``` - On `clone(CLONE_THREAD)`: child gets same tgid as parent, new tid - On fork: child gets new tgid = child's tid - `getpid()` returns tgid - `gettid()` returns tid - `kill(tgid, sig)` delivers signal to all threads in thread group #### R3.2 — Thread group signal delivery ```rust fn deliver_signal_to_thread_group(tgid: usize, sig: Signal) { for context in contexts_in_thread_group(tgid) { // Pick a thread that hasn't blocked this signal if !context.sig_blocked(sig) { context.deliver_signal(sig); break; } } } ``` **Patches:** - `local/patches/kernel/P5-tgid.patch` — thread group ID kernel support - `local/patches/kernel/P5-tgid-signal.patch` — process-targeted signal delivery - `local/patches/relibc/P5-gettid.patch` — gettid() syscall --- ## 5. Dependency Chain ``` Phase S1 (observability) │ ├──► Phase S2 (real-time scheduling) ────┐ │ │ │ │ ├──► Phase R1 (POSIX sched API) │ │ │ │ │ └──► KWin input thread priority │ │ │ ├──► Phase S3 (load balancing) ───────────┤ │ │ │ │ └──► Mesa worker thread scaling │ │ │ ├──► Phase S4 (futex enhancements) ───────┤ │ │ │ │ ├──► Phase R2 (PI/robust mutex) │ │ │ │ │ └──► KWin compositor lock │ │ │ ├──► Phase S5 (dynamic prio + affinity) ──┤ │ │ │ │ └──► Application CPU pinning │ │ │ ├──► Phase R3 (thread groups) ────────────┤ │ │ │ │ └──► process-targeted signals │ │ │ └──► Phase S6 (NUMA) ─────────────────────┘ │ └──► Multi-socket server performance ``` **Independent work (can run in parallel):** - S2 (RT scheduling) + R1 (POSIX sched API) — parallel - S4 (futex) + R2 (PI/robust mutex) — parallel - S3 (load balancing) can start after S1 but independently of S2 - S6 (NUMA) depends on S3 (per-CPU queues) but not on S4/S5 --- ## 6. Integration with Existing Plans | Existing Plan | Relationship | |---------------|-------------| | `KERNEL-IPC-CREDENTIAL-PLAN.md` | Sibling — this plan covers scheduler + futex + threading; that plan covers credentials + access control + IPC completeness | | `RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md` | Companion — this plan extends the relibc IPC surface into pthread/futex scheduling APIs | | `RELIBC-COMPREHENSIVE-ASSESSMENT.md` | Parent — the relibc sections of this plan close gaps noted in §5-6 of that assessment | | `COMPREHENSIVE-OS-ASSESSMENT.md` | Parent — this plan closes §2 kernel gaps for scheduler/scalability | | `CONSOLE-TO-KDE-DESKTOP-PLAN.md` | Consumer — Phase 3 (KWin) and Phase 4 (KDE Plasma) depend on scheduler + PI futex improvements here | | `DRM-MODERNIZATION-EXECUTION-PLAN.md` | Sibling — GPU worker thread scheduling benefits from load balancing (S3) | | `IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md` | Sibling — IRQ latency affects scheduling latency | --- ## 7. Patch Governance All kernel and relibc source changes follow the durability policy (`local/AGENTS.md`): ``` local/patches/ ├── kernel/ │ (Delivered: P6-* and P7-* patches below. P5-sched-* entries are planned future carriers.) │ ├── P5-sched-observability.patch # S1 │ ├── P5-sched-policy.patch # S2 │ ├── P5-sched-policy-proc.patch # S2 proc: scheme │ ├── P6-percpu-runqueues.patch # S3 (delivered: infrastructure) │ ├── P6-futex-sharding.patch # S4 (delivered: sharding) │ ├── P6-vruntime-context.patch # S5 (delivered: field + init) │ ├── P6-vruntime-switch.patch # S5 (delivered: update + selection) │ ├── (remaining S3-S6 patches deferred) ├── relibc/ │ ├── P5-sched-observe.patch # R1 baseline │ ├── P5-sched-setscheduler.patch # R1 │ ├── P5-sched-getscheduler.patch # R1 │ ├── P5-sched-priority.patch # R1 │ ├── P5-sched-complete.patch # R1 remaining stubs │ ├── (PI/robust mutex deferred) # R2 │ ├── P7-setpriority.patch # S5 (delivered) │ ├── P7-pthread-affinity.patch # S5 (delivered) │ └── P5-gettid.patch # R3 ``` --- ## 8. Validation and Evidence ### 8.1 Build Evidence | Check | Command | |-------|---------| | Kernel compiles | `make r.kernel` | | relibc compiles | `make r.relibc` | | Full OS builds | `make all CONFIG_NAME=redbear-full` | ### 8.2 Runtime Evidence | Test | Verification | |------|-------------| | `sched_getscheduler()` returns policy | `redbear-info --sched` | | `pthread_setschedparam()` changes priority | Threaded test binary: `test-sched-priority` | | RT thread preempts SCHED_OTHER | Latency test: RT thread wakes within 100μs | | Work stealing across CPUs | `redbear-info --sched` shows balanced queue depths | | PI futex prevents priority inversion | PI test: low-prio holder, high-prio waiter, medium-prio contester | | Robust mutex recovery after thread kill | Robust test: kill thread holding mutex, verify EOWNERDEAD | | Thread affinity pinning | `taskset`-like test: verify thread stays on assigned CPU | | Load balancing on fork bomb | Spawn 2× CPUs threads, verify even distribution | ### 8.3 Verification Scripts ```bash local/scripts/test-sched-qemu.sh # Scheduler metric validation local/scripts/test-sched-rt-qemu.sh # Real-time scheduling proof local/scripts/test-futex-pi-qemu.sh # PI futex proof local/scripts/test-futex-robust-qemu.sh # Robust futex proof local/scripts/test-sched-balance-qemu.sh # Load balancing proof (multi-vCPU) ``` --- ## 9. Bottom Line The Redox kernel scheduler is **functional but simple** — a correct DWRR implementation that works for a lightly-loaded system. For the KDE/Wayland desktop with dozens of competing threads (compositor, rendering, I/O, timers, D-Bus, input), it needs: 1. **Real-time scheduling** (S2) — for audio and compositor input threads 2. **PI futexes** (S4/R2) — to prevent the compositor lock from being inverted by background work 3. **Load balancing** (S3) — to use all available cores efficiently 4. **Dynamic priority** (S5) — to keep the compositor responsive under CPU load These four items are the **critical path** to a responsive desktop. The remaining items (NUMA, thread groups, robust mutexes, affinity API) are important for correctness and server-class workloads but not desktop-blocking. **Total estimated effort:** 13 weeks with 1-2 kernel developers, delivering incremental improvements at each phase boundary.