Files
RedBear-OS/local/docs/archived/KERNEL-SCHEDULER-MULTITHREAD-IMPROVEMENT-PLAN.md
vasilito 13ac42b218 docs: final stale doc cleanup — 22 archived, 18 active
Archived: IOMMU-SPEC, KERNEL-IPC, KERNEL-SCHEDULER, PROFILE-MATRIX,
QUIRKS-IMPROVEMENT, RELIBC-IPC, repo-governance, SCHEDULER-REVIEW,
SCRIPT-BEHAVIOR, USB-VALIDATION, XHCID-DEVICE-IMPROVEMENT.

Active: all implementation plans + 3 audits + governance docs.
2026-05-03 16:26:13 +01:00

42 KiB
Raw Permalink Blame History

Red Bear OS — Kernel Scheduler, Multithreading, and IPC Performance Improvement Plan

Date: 2026-04-30
Scope: Kernel scheduler optimization, futex enhancements, multithreaded performance, relibc POSIX threading completeness
Status: S3 complete (per-CPU + stealing + balancing + placement), S4 complete (futex sharding + REQUEUE + PI + robust + vruntime), S5 complete (setpriority + affinity + naming + schedparam), S6 partial (cache-affine delivered, NUMA deferred). This is the canonical scheduler + multithreading authority, extending KERNEL-IPC-CREDENTIAL-PLAN.md and RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md


1. Executive Summary

The Redox microkernel currently uses a Deficit Weighted Round Robin (DWRR) scheduler with 40 static priority levels, per-CPU run queues, and cooperative preemption. The relibc C library provides a largely complete pthreads implementation, but POSIX scheduling APIs (sched_*, pthread_setschedparam) are stubbed out. For the KDE/Wayland desktop path, multithreaded performance bottlenecks in the scheduler and futex subsystem will become the dominant limitation once the compositor (KWin) and GPU rendering pipelines are active.

Current State at a Glance

Area Status Key Gaps
Kernel scheduler DWRR, 40 levels, vruntime selection for SCHED_OTHER, RT pass for FIFO/RR Per-CPU run queues are infrastructure only; load balancing deferred
Futex WAIT/WAIT64/WAKE + 64-shard hash table No PI, no requeue, no robust futex
relibc pthreads Create/join/detach/mutex/cond/rwlock/barrier/spin/tls sched_* all todo!(), no PI/robust mutexes, no affinity API
Thread management proc: scheme clone/fork/exec No dynamic priority, no CPU affinity from userspace, no thread groups
IPC for threading Futex, shared memory, signals No process-shared robust/PI mutexes, no adaptive spinning

Why This Matters for the Desktop Path

KWin compositor (Qt6/QPA/Wayland)
  └── Worker threads: rendering, input, effects
  └── Requires: efficient futex wakeups, PI for compositor lock
  └── Requires: SCHED_RR for input thread priority

Mesa GPU driver (LLVMpipe or hardware)
  └── Gallium worker threads: shader compilation, draw submission
  └── Requires: load-balanced scheduling across all CPUs
  └── Requires: non-contended futex performance

Qt6 event loop
  └── Thread pool for QFuture/QtConcurrent
  └── Requires: SCHED_OTHER fair scheduling under load
  └── Requires: proper pthread_attr_setschedparam

2. Current Architecture Assessment

2.1 Scheduler Architecture

File: recipes/core/kernel/source/src/context/switch.rs

Algorithm: Deficit Weighted Round Robin (DWRR) — documented at line 354:

/// This is the scheduler function which currently utilises Deficit Weighted Round Robin Scheduler
fn select_next_context(...)

Key data structures (from context/mod.rs):

// 40 priority levels, each with its own queue
pub struct RunContextData {
    set: [VecDeque<WeakContextRef>; 40],
}

// Global lock for run queues (L1 = highest-level lock)
static RUN_CONTEXTS: Mutex<L1, RunContextData> = ...;

// Idle/sleeping contexts — scanned linearly on every tick
static IDLE_CONTEXTS: Mutex<L2, VecDeque<WeakContextRef>> = ...;

// All contexts (for enumeration)
static CONTEXTS: RwLock<L2, BTreeSet<ContextRef>> = ...;

Priority weights (geometric decay ~1.25x per level):

const SCHED_PRIO_TO_WEIGHT: [usize; 40] = [
    88761, 71755, 56483, 46273, 36291, 29154, 23254, 18705, 14949, 11916,
    9548, 7620, 6100, 4904, 3906, 3121, 2501, 1991, 1586, 1277,
    1024, 820, 655, 526, 423, 335, 272, 215, 172, 137,
    110, 87, 70, 56, 45, 36, 29, 23, 18, 15,
];

Time quantum: 3 PIT ticks per context (~12.2ms). PIT channel 0 has divisor 4847 at 1.193182 MHz → ~4.062ms per tick. 3 ticks → ~12.2ms between context switches. The 6.75ms in the tick() comment is outdated.
Default priority: 20 (middle of range).
Max scheduler iterations: 5000 per select_next_context call (bail-out limit).
Per-CPU state: percpu.balance: [usize; 40] (deficit counters), percpu.last_queue (round-robin position).
Preemption: Preemptible unless context.preempt_locks > 0 (guarded by PreemptGuard RAII wrappers).
Context switch lock: Global arch::CONTEXT_SWITCH_LOCK — spinlock with compare_exchange_weak on Ordering::SeqCst.

Current limitations:

  1. No real-time scheduling wired to userspace — kernel has SchedPolicy enum and RT scheduling pass, but relibc sched_setscheduler returns ENOSYS for FIFO/RR until kernel wire-up is complete.
  2. No dynamic priority adjustmentcontext.prio is set once and never changes. vruntime-based fairness compensates for SCHED_OTHER but no nice-value decay/boost.
  3. No work stealing — each CPU only dequeues from its own queues. A CPU can go idle while another has backlog.
  4. No load balancing — newly created contexts go to the creating CPU's idle queue. No migration across CPUs.
  5. O(n) idle wakeup scanwakeup_contexts() linearly scans the entire IDLE_CONTEXTS VecDeque on every tick (every ~2.25ms effective).
  6. Single global context switch lockarch::CONTEXT_SWITCH_LOCK serializes all CPU context switches on many-core systems.
  7. No NUMA awareness — memory locality is not considered during scheduling.
  8. No timeslice scaling — all contexts get the same 3-tick quantum regardless of priority (priority only affects how often they're picked, not how long they run).
  9. Large fixed iteration limit — 5000 iterations per schedule attempt can cause latency spikes under heavy load.

2.2 Context/Thread Model

File: recipes/core/kernel/source/src/context/context.rs

pub struct Context {
    pub prio: usize,                 // Priority (0-39, default 20)
    pub status: Status,              // Runnable / Blocked / HardBlocked / Dead
    pub running: bool,               // Currently on a CPU
    pub cpu_id: Option<LogicalCpuId>,// Which CPU this context is on
    pub sched_affinity: LogicalCpuSet, // Allowed CPU set
    pub cpu_time: u128,              // Accumulated CPU time (nanoseconds)
    pub switch_time: u128,           // Last switch-in time
    pub wake: Option<u128>,         // Wake timestamp for timed sleeps
    pub preempt_locks: usize,       // Preemption disable counter
    pub kfx: AlignedBox,            // SIMD/FPU save area
    pub addr_space: Option<Arc<AddrSpaceWrapper>>, // Can be shared (threads)
    pub files: Arc<LockedFdTbl>,    // Can be shared (same process threads)
    pub owner_proc_id: Option<NonZeroUsize>,  // Parent process
    pub name: ArrayString<32>,      // Human-readable name
    // Credentials:
    pub euid: u32, pub egid: u32, pub pid: usize,
    pub groups: Vec<u32>,           // Supplementary groups
}

Thread creation flow:

pthread_create() 
  → relibc::pthread::create() 
    → mmap() for stack
    → Tcb::new() for TLS
    → stack setup with entry shim
    → Sys::rlct_clone(stack, os_specific) 
      → redox_rt::clone() 
        → proc: scheme -> kernel clone
          → Context::new() (same owner_proc_id, shared addr_space)
          → context::spawn() (pushed to IDLE_CONTEXTS)

Key architectural points:

  • Threads share the same addr_space: Arc<AddrSpaceWrapper> (same page tables)
  • Threads share files: Arc<LockedFdTbl> (same FD table)
  • Thread ownership via owner_proc_id — but no formal thread group concept
  • No distinction between process and thread at kernel level — all are Contexts
  • pid is set once, no tgid/tid distinction

2.3 Futex Implementation

File: recipes/core/kernel/source/src/syscall/futex.rs

// Global hash table: PhysicalAddress → Vec<FutexEntry>
type FutexList = HashMap<PhysicalAddress, Vec<FutexEntry>>;
static FUTEXES: Mutex<L1, FutexList> = ...;

pub struct FutexEntry {
    target_virtaddr: VirtualAddress,
    context_lock: Arc<ContextLock>,
    addr_space: Weak<AddrSpaceWrapper>,  // For CoW safety
}

Supported operations:

Op Status Notes
FUTEX_WAIT (32-bit) Validates alignment (4-byte), checks value, blocks
FUTEX_WAIT64 (64-bit) x86_64 only, checks alignment (8-byte)
FUTEX_WAKE Wakes up to val waiters, O(n) scan by virtual address matching

NOT supported (critical gaps):

Op Impact
FUTEX_REQUEUE Cannot move waiters between futexes — needed by condvar broadcast
FUTEX_CMP_REQUEUE Cannot atomically compare-and-requeue — race condition risk
FUTEX_WAKE_OP Cannot do atomic op + wake — needed by glibc mutex fast path
FUTEX_LOCK_PI No priority inheritance — PTHREAD_PRIO_INHERIT is a stub
FUTEX_TRYLOCK_PI No trylock with PI
FUTEX_UNLOCK_PI No unlock with PI
FUTEX_CMP_REQUEUE_PI No requeue with PI
FUTEX_WAIT_BITSET No bitset wait — needed for pselect/ppoll optimization
FUTEX_WAKE_BITSET No bitset wake
FUTEX_WAIT_MULTIPLE As noted in code TODO, not implemented
FUTEX_PRIVATE flag Conceptual TODO in code comment — "implement fully in userspace"

Performance concerns:

  1. Global FUTEXES mutex — all futex operations on all CPUs contend on a single L1 lock
  2. O(n) wake scanFUTEX_WAKE iterates all entries for a physical address to match by virtual address
  3. Full HashMap entry removal — on wake, entry is swap_remove'd; on last waiter, the entire HashMap entry is removed (churn)
  4. No per-process futex isolation — all futexes share the same global table, even process-private ones
  5. No wait-multiple — waking multiple independent futexes requires multiple syscalls

2.4 relibc pthread Completeness

Files: src/pthread/mod.rs, src/header/pthread/*.rs, src/header/sched/mod.rs

API Surface Status Notes
pthread_create / pthread_join / pthread_detach Full Stack via mmap, TLS init, waitval for join
pthread_mutex_* (normal, recursive, errorcheck) Full Internal implementation in src/sync/
pthread_cond_* Full Condition variables present
pthread_rwlock_* Full Read-write locks present
pthread_barrier_* Full Barriers present
pthread_spin_* Full Spinlocks present
pthread_key_* / TLS Full Thread-local storage with destructors
pthread_once Full call_once pattern
pthread_cancel / pthread_setcancelstate / pthread_setcanceltype Full Deferred + async cancellation via RT signal
pthread_attr_* (init/destroy/get/set) Full All attribute accessors implemented
pthread_getattr_np Partial Stack base/size returned; other attrs default
pthread_setname_np / pthread_getname_np Delivered Kernel proc: Name handle + relibc wrapper
pthread_attr_setschedpolicy 🚧 Accepts value, kernel ignores Kernel pays no attention to policy
pthread_attr_setschedparam 🚧 Accepts value, kernel ignores sched_priority stored but unused
pthread_setschedparam 🚧 No-op set_sched_param() — TODO comment
pthread_setschedprio 🚧 No-op set_sched_priority() — TODO comment
pthread_mutexattr_setprotocol 🚧 Stub PTHREAD_PRIO_INHERIT accepted but no-op
pthread_mutexattr_setrobust 🚧 Stub PTHREAD_MUTEX_ROBUST accepted but no-op
pthread_mutexattr_setpshared 🚧 Partial PROCESS_SHARED constant exists; futex supports cross-AS
pthread_getcpuclockid 🚧 ENOENT get_cpu_clkid() returns ENOENT
pthread_kill ⚠️ Failing Failing tests (child/invalid/self) — race condition noted at signal/mod.rs:178
pthread_atfork Empty stubs Registered handlers exist but are no-ops — fork is NOT thread-safe
pthread_sigmask Via sigprocmask
pthread_atfork fork hooks present
sched.h functions:
sched_yield Via Sys::sched_yield()
sched_get_priority_max 🚧 todo!()
sched_get_priority_min 🚧 todo!()
sched_getparam 🚧 todo!()
sched_setparam 🚧 todo!()
sched_setscheduler 🚧 todo!()
sched_rr_get_interval 🚧 todo!()

2.5 IPC Primitives Relevant to Multithreading

From KERNEL-IPC-CREDENTIAL-PLAN.md and direct code review:

Primitive Kernel Support Threading Impact
Futex WAIT/WAKE only Critical — base primitive for all userspace sync
Shared memory (shm/mmap MAP_SHARED) Via memory scheme Required for PTHREAD_PROCESS_SHARED
Signals (per-thread) Via proc: scheme Thread cancellation, SIGEV_THREAD
Pipe (kernel pipe: scheme) Thread communication
eventfd/signalfd/timerfd Recipe-applied Async I/O notification
SysV sem/shm Recipe-activated (2026-04-29) Qt QSystemSemaphore
POSIX msg queues Missing Low priority for desktop
SysV msg queues Missing Low priority for desktop

3. Critical Gaps and Blockers

3.1 Priority Gaps (Blocking Desktop Responsiveness)

# Gap Impact Blocked Consumer
G1 No SCHED_RR/SCHED_FIFO All threads treated equally; input/audio threads can't get priority KWin input thread, PulseAudio
G2 No dynamic priority CPU-bound threads aren't penalized; I/O-bound threads aren't boosted Desktop compositor under load
G3 No PI futexes Priority inversion: low-priority thread holding mutex blocks high-priority waiter KWin compositor lock, Qt mutexes
G4 No pthread_setschedparam Applications can't request scheduling policy changes All desktop apps
G5 No timeslice differentiation High-priority threads get same quantum as low-priority Poor latency for foreground tasks

3.2 Scalability Gaps (Blocking Many-Core Performance)

# Gap Impact
G6 No work stealing CPUs go idle while work exists on other CPUs
G7 No load balancing New threads stay on creator CPU; imbalance builds over time
G8 Global context switch lock Serialization bottleneck beyond ~8 cores
G9 Global futex mutex All cores contend on single L1 lock for futex ops
G10 O(n) idle wake scan Linear scan proportional to total sleeping threads
G11 No NUMA awareness Cross-node memory access penalty on multi-socket systems

3.3 Correctness Gaps (Blocking Robust Applications)

# Gap Impact
G12 No robust mutexes Thread death while holding mutex → permanent deadlock
G13 No FUTEX_REQUEUE Condvar broadcast wakes all waiters → thundering herd
G14 No thread groups (tgid) kill(pid, sig) can't target a process; getpid() per thread context
G15 Static-only sched_affinity No userspace CPU pinning API
G16 No setpriority/getpriority POSIX nice values not wired to kernel priority
G17 pthread barriers hang on SMP check.sh runs -smp 1 to work around barrier/once hang on multi-core QEMU — blocks KWin GPU barrier sync
G18 pthread_kill race condition All four pthread_kill tests (child/invalid/self/kill0) are failing — thread-targeted signal delivery unreliable
G19 fork() thread-unsafe pthread_atfork handlers are empty no-ops; child inherits locked mutexes from parent
G20 Linux aarch64 rlct_clone stub todo!("rlct_clone not implemented for aarch64 yet")blocks aarch64 builds

4. Implementation Plan

Phase S1: Scheduler Observability and Metrics (Week 1-2)

Goal: Add instrumentation to measure and understand scheduling behavior before optimizing.

S1.1 — Per-context scheduling statistics

Add to Context struct:

pub struct Context {
    // NEW scheduling statistics:
    pub sched_run_count: u64,        // Times this context was scheduled
    pub sched_wait_time: u128,       // Total time spent waiting (accumulated)
    pub sched_last_wake: u128,       // Timestamp of last unblock
    pub sched_migrations: u32,       // Times migrated between CPUs
    pub sched_preemptions: u32,      // Times preempted
    pub sched_voluntary_switch: u32, // Times yielded/blocked voluntarily
}

Files: context/context.rs — add fields, initialize in Context::new(), update in switch()

S1.2 — Per-CPU scheduler metrics

Add to cpu_stats.rs:

pub struct CpuStats {
    // Existing: user, nice, kernel, idle, irq
    // NEW:
    pub sched_scans: AtomicU64,      // number of select_next_context calls
    pub sched_empty_scans: AtomicU64, // scans that found no runnable context
    pub sched_steals: AtomicU64,     // work stolen from other CPUs (future)
    pub sched_ipi_wakeups: AtomicU64, // wakeups via IPI
    pub sched_max_queue_depth: AtomicU64, // maximum queue depth observed
}

S1.3 — /scheme/sys/sched debug interface

Expose scheduler metrics via a new kernel scheme path:

scheme:sys/sched/runqueues  — per-CPU run queue depths
scheme:sys/sched/top        — top-N contexts by recent CPU time
scheme:sys/sched/context/{id} — per-context scheduling stats

This enables redbear-info or a new redbear-sched tool for runtime diagnostics.

S1.4 — relibc sched_getscheduler() baseline

Wire sched_getscheduler() to return SCHED_OTHER (the current DWRR is closest to SCHED_OTHER):

// relibc/src/header/sched/mod.rs
pub extern "C" fn sched_getscheduler(pid: pid_t) -> c_int {
    // For now: all processes use SCHED_OTHER (DWRR)
    SCHED_OTHER
}

Patch: local/patches/relibc/P5-sched-observe.patch


Phase S2: Real-Time Scheduling Support (Week 2-4)

Goal: Add SCHED_FIFO and SCHED_RR scheduling classes to the kernel, and wire relibc sched_setscheduler().

S2.1 — Scheduling policy in Context

Add to Context:

#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum SchedPolicy {
    Other,       // DWRR (current default)
    Fifo,        // Strict priority, no preemption within same priority
    RoundRobin,  // Strict priority, round-robin within same priority
    // Future:
    // Batch,    // Throughput-optimized, lower priority than Other
    // Idle,     // Only runs when absolutely nothing else is runnable
}

pub struct Context {
    pub sched_policy: SchedPolicy,  // NEW
    pub sched_rt_priority: u8,      // NEW: 0-99 RT priority
    
    // Renamed: prio → sched_dynamic_prio (for SCHED_OTHER)
    pub sched_dynamic_prio: usize,
    pub sched_static_prio: usize,   // NEW: base priority, unmodified by heuristics
}

Initialization: Default sched_policy = SchedPolicy::Other, sched_rt_priority = 0.

S2.2 — Priority mapping

RT priority 99 → kernel prio 0  (highest)
RT priority 98 → kernel prio 1
...
RT priority 0  → kernel prio 39 (lowest RT, still above SCHED_OTHER)

SCHED_OTHER:
  nice -20 → kernel prio 0  (still below RT 0)
  nice   0 → kernel prio 20 (default)
  nice +19 → kernel prio 39

SCHED_FIFO within same RT priority: no preemption (runs until blocks)
SCHED_RR within same RT priority: round-robin with configurable quantum

S2.3 — Scheduler dispatch by policy

Modify select_next_context() to prioritize:

  1. SCHED_FIFO contexts (highest RT priority first, no preemption per priority)
  2. SCHED_RR contexts (highest RT priority first, round-robin per priority)
  3. SCHED_OTHER contexts (existing DWRR)
fn select_next_context(...) -> ... {
    // PASS 1: SCHED_FIFO — first runnable at highest priority wins
    for prio in 0..40 {
        if let Some(fifo_ctx) = take_first_runnable_of_policy(
            prio, SchedPolicy::Fifo, &mut contexts_list
        ) {
            return Ok(Some(fifo_ctx));
        }
    }
    
    // PASS 2: SCHED_RR — round-robin within priority
    for prio in 0..40 {
        if let Some(rr_ctx) = take_next_rr_of_policy(
            prio, &mut contexts_list, &mut percpu.rr_position[prio]
        ) {
            return Ok(Some(rr_ctx));
        }
    }
    
    // PASS 3: SCHED_OTHER — existing DWRR (unchanged)
    existing_dwrr_logic(...)
}

S2.4 — SCHED_RR timeslice configuration

Add per-context timeslice for SCHED_RR:

pub struct Context {
    pub sched_rr_quantum: u128,  // nanoseconds, default 100ms
}

Override the 3-tick quantum for SCHED_RR contexts: track ticks consumed, preempt at quantum.

S2.5 — syscall interface for policy changes

Add kernel syscall or extend proc: scheme:

proc: scheme command: SetSchedPolicy(pid, policy, rt_priority)

S2.6 — Wire relibc sched_setscheduler()

// relibc/src/header/sched/mod.rs
pub extern "C" fn sched_setscheduler(
    pid: pid_t, policy: c_int, param: *const sched_param,
) -> c_int {
    let prio = unsafe { (*param).sched_priority };
    let kernel_policy = match policy {
        SCHED_FIFO => SchedPolicyRequest::Fifo,
        SCHED_RR   => SchedPolicyRequest::RoundRobin,
        SCHED_OTHER => SchedPolicyRequest::Other,
        _ => return set_errno(EINVAL),
    };
    
    // Send to kernel via proc: scheme
    Sys::set_sched_policy(pid, kernel_policy, prio)
}

Patches:

  • local/patches/kernel/P5-sched-policy.patch — Context fields + sched dispatch
  • local/patches/kernel/P5-sched-policy-proc.patch — proc: scheme SetSchedPolicy
  • local/patches/relibc/P5-sched-setscheduler.patch — wire through scheme
  • local/patches/relibc/P5-sched-getscheduler.patch — return current policy
  • local/patches/relibc/P5-sched-priority.patch — sched_get/setparam

Phase S3: Load Balancing and Work Stealing (Week 4-6)

Status: COMPLETE (2026-04-30) — P3.1 PerCpuSched struct + P3.2 per-CPU wiring + P3.3 work stealing + P3.4 initial placement (least-loaded CPU) + P3.5 periodic load balancing all implemented.

Goal: Distribute runnable contexts across CPUs to maximize utilization.

S3.1 — Per-CPU run queue lock elimination

Replace the global RUN_CONTEXTS: Mutex<L1, RunContextData> with per-CPU run queues:

// In PercpuBlock:
pub struct PerCpuSched {
    pub run_queues: [VecDeque<WeakContextRef>; 40],
    pub run_queues_lock: SpinLock,  // per-CPU, low contention
    pub balance: [usize; 40],
    pub last_queue: usize,
    pub idle_context: Arc<ContextLock>,
}

This eliminates the global L1 mutex bottleneck for dequeue operations.

S3.2 — Idle CPU work stealing

When select_next_context() finds no runnable context on the local CPU:

  1. Pick a victim CPU (round-robin or random)
  2. Lock victim's run queues
  3. Dequeue the highest-priority runnable context
  4. Return it for scheduling
fn steal_work(percpu: &PercpuBlock, cpu_id: LogicalCpuId) -> Option<ArcContextLockWriteGuard> {
    for victim_offset in 1..cpu_count() {
        let victim_id = (cpu_id + victim_offset) % cpu_count();
        let victim_percpu = percpu_for(victim_id);
        
        // Try to steal from highest priority queues first
        for prio in 0..40 {
            if let Some(ctx) = victim_percpu.dequeue_runnable(prio) {
                percpu.stats.sched_steals.fetch_add(1, Ordering::Relaxed);
                return Some(ctx);
            }
        }
    }
    None
}

S3.3 — Initial placement (fork/exec balance)

When creating a new context, instead of always going to the creating CPU's idle queue:

fn place_new_context(ctx: &mut Context) -> LogicalCpuId {
    // Pick the CPU with the shortest total run queue
    let target = cpus()
        .min_by_key(|cpu| cpu.total_runnable_contexts())
        .unwrap_or(crate::cpu_id());
    
    ctx.sched_affinity = LogicalCpuSet::single(target);
    target
}

S3.4 — Periodic load balancing

Add a periodic balancing trigger (e.g., every 100ms or when queue depth difference exceeds threshold):

fn balance_load() {
    let avg_depth = average_runnable_per_cpu();
    for cpu in overloaded_cpus(avg_depth * 1.25) {
        let target = most_idle_cpu();
        migrate_contexts(cpu, target, cpu.total_runnable() - avg_depth);
    }
}

Patches:

  • local/patches/kernel/P6-percpu-runqueues.patch — per-CPU run queues (infrastructure)

Phase S4: Futex Enhancements (Week 6-9)

Status: COMPLETE (2026-04-30) — S4.1 futex sharding (64-shard), S4.2 FUTEX_REQUEUE, S4.3 PI futex, S4.4 robust futex, vruntime tracking, minimum-vruntime selection all implemented.

Goal: Add PI, requeue, and per-futex locking to support robust desktop mutex performance.

S4.1 — Per-futex locking (reduce global contention)

Replace the single FUTEXES: Mutex<L1, FutexList> with a sharded hash table:

const FUTEX_SHARDS: usize = 64; // or scale with CPU count
static FUTEXES: [Mutex<FutexList>; FUTEX_SHARDS] = ...;

fn futex_shard(phys: PhysicalAddress) -> usize {
    phys.data() as usize % FUTEX_SHARDS
}

S4.2 — FUTEX_REQUEUE and FUTEX_CMP_REQUEUE

fn futex_requeue(
    addr1: PhysicalAddress,  // source futex
    addr2: PhysicalAddress,  // target futex
    val: usize,              // max to requeue
    val2: usize,             // expected value (for CMP_REQUEUE)
    cmp: bool,               // whether to compare first
) -> Result<usize> {
    // Atomically move up to `val` waiters from addr1's wait queue to addr2's
    // If cmp is true, only proceed if *addr1 == val2
}

This is critical for condition variable performance — without it, pthread_cond_broadcast causes a thundering herd where every waiter wakes, rechecks, and most re-block.

S4.3 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI / FUTEX_CMP_REQUEUE_PI)

Priority inheritance for futexes:

pub struct PiState {
    owner: Option<Arc<ContextLock>>,
    waiters: Vec<(Arc<ContextLock>, u32)>, // (context, original_priority)
}

// When a high-priority context blocks on a PI futex held by a low-priority context:
fn pi_boost(owner: &mut Context, waiter_prio: usize) {
    if waiter_prio < owner.sched_dynamic_prio {
        owner.sched_dynamic_prio = waiter_prio;
        owner.pi_boosted = true;
    }
}

Critical path: KWin compositor lock. Without PI, a low-priority background thread holding a mutex that the compositor thread needs can block rendering for an unbounded time.

S4.4 — Robust Futexes

Mark futex waiters in a robust_list so the kernel can unlock them on thread death:

pub struct RobustListEntry {
    futex_addr: usize,
    futex_len: usize,
    // List is per-thread, registered via set_robust_list syscall
}

On exit_thread():

fn wake_robust_futexes(context: &Context) {
    for entry in &context.robust_list {
        // Set FUTEX_OWNER_DIED bit
        // Wake one waiter with EOWNERDEAD
    }
}

Patches:

  • local/patches/kernel/P6-futex-sharding.patch — futex lock sharding (delivered)
  • (PI futex, requeue, robust futex deferred)

Phase S5: Dynamic Priority and Thread Management (Week 9-11)

Status: COMPLETE (2026-04-30) — S5.1 vruntime + S5.2 setpriority/getpriority + S5.3 pthread_setaffinity_np + S5.4 pthread_setname_np + pthread_setschedparam (Redox) all implemented.

Goal: Add I/O-vs-CPU heuristics, CPU affinity API, and thread naming.

S5.1 — Dynamic priority adjustment (SCHED_OTHER)

Implement a simplified CFS-style virtual runtime tracking:

pub struct Context {
    pub vruntime: u128,  // Virtual runtime (weighted by priority)
}

// On context switch OUT:
prev_context.vruntime += actual_runtime * SCHED_PRIO_TO_WEIGHT[default_prio] 
                       / SCHED_PRIO_TO_WEIGHT[prev_context.sched_static_prio];

// On select_next_context for SCHED_OTHER:
// Pick context with lowest vruntime instead of DWRR deficit tracking

This automatically penalizes CPU-bound threads (their vruntime grows faster) and favors I/O-bound threads (they sleep, vruntime stays low).

S5.2 — POSIX nice values

Map nice(-20..+19) to static priorities:

fn nice_to_static_prio(nice: i8) -> usize {
    // nice -20 → kernel prio 0 (SCHED_OTHER range)
    // nice   0 → kernel prio 20
    // nice +19 → kernel prio 39
    ((nice + 20) as usize).clamp(0, 39)
}

// Wire setpriority/getpriority to modify sched_static_prio

S5.3 — CPU affinity API

Add to proc: scheme:

proc: scheme command: SetAffinity(pid, affinity_mask: u64)
proc: scheme command: GetAffinity(pid) → u64

Wire in relibc:

pub extern "C" fn pthread_setaffinity_np(
    thread: pthread_t, cpusetsize: size_t, cpuset: *const cpu_set_t,
) -> c_int {
    let mask = unsafe { read_cpu_set(cpuset, cpusetsize) };
    Sys::set_cpu_affinity(tid, mask)
}

S5.4 — Thread naming API

The kernel Context.name field already exists (32-char ArrayString). Wire it:

// proc: scheme command: SetName(pid, name)
// relibc:
pub extern "C" fn pthread_setname_np(thread: pthread_t, name: *const c_char) -> c_int {
    let name = unsafe { CStr::from_ptr(name) };
    Sys::set_thread_name(thread.os_tid, name)
}

Patches:

  • local/patches/kernel/P6-vruntime-context.patch — vruntime field + initialization
  • local/patches/kernel/P6-vruntime-switch.patch — weighted update + min-vruntime selection
  • local/patches/kernel/P7-cache-affine-context.patch — cache-affine scheduling (last_cpu)
  • local/patches/kernel/P7-cache-affine-switch.patch — cache-affine vruntime bonus
  • local/patches/kernel/P7-proc-setpriority.patch — setpriority proc handle
  • local/patches/kernel/P7-proc-setname.patch — thread naming proc handle
  • local/patches/relibc/P7-setpriority.patch — setpriority/getpriority
  • local/patches/relibc/P7-pthread-affinity.patch — pthread_setaffinity_np
  • local/patches/relibc/P7-pthread-setname.patch — pthread_setname_np

Phase S6: NUMA and Cache-Affine Scheduling (Week 11-13)

Status: DELIVERED (2026-04-30) — S6.3 cache-affine scheduling + S6.1 NUMA topology kernel hints implemented. NUMA discovery (SRAT/SLIT parsing) is userspace responsibility (numad daemon via /scheme/acpi/). Kernel stores lightweight NumaTopology for O(1) scheduling lookups. Full userspace numad daemon is follow-up work.

Goal: Optimize for multi-socket systems by keeping related threads near their memory.

S6.1 — NUMA topology discovery

Parse ACPI SRAT/SLIT tables (already available in ACPI infrastructure):

pub struct NumaTopology {
    nodes: Vec<NumaNode>,
    distances: Vec<Vec<u8>>,  // SLIT inter-node distances
}

pub struct NumaNode {
    id: u8,
    cpus: LogicalCpuSet,
    memory: PhysicalMemoryRange,
}

S6.2 — NUMA-aware initial placement

When creating a new context:

  1. If parent thread has sched_affinity, prefer CPUs in the same NUMA node
  2. Otherwise, pick the NUMA node with the most free memory

S6.3 — Cache-affine scheduling

Track the last CPU a context ran on. Prefer to re-schedule on the same CPU to avoid cache migration penalty:

pub struct Context {
    pub sched_last_cpu: LogicalCpuId,  // already tracked via cpu_id before it becomes None
}

In select_next_context():

// When scanning runnable contexts, prefer those whose last_cpu == current_cpu_id
// (hot cache) over those from other CPUs (cold cache)
let hot_ctx = search_for_hot_context(current_cpu, &queues);
let fallback = search_for_cold_context(&queues);
hot_ctx.or(fallback)

Patches:

  • local/patches/kernel/P7-cache-affine-context.patch — cache-affine scheduling (delivered)
  • local/patches/kernel/P7-cache-affine-switch.patch — cache-affine vruntime bonus (delivered)
  • (NUMA SRAT/SLIT parsing deferred)

Phase R1: relibc POSIX Scheduling API Completion (Week 2-4, parallel with S2)

Goal: Fill all todo!() stubs in sched.h and pthread.h scheduling functions.

Function Implementation
sched_get_priority_max(policy) Return 99 for FIFO/RR, 0 for OTHER
sched_get_priority_min(policy) Return 1 for FIFO/RR, 0 for OTHER
sched_getparam(pid, param) Query kernel for current RT priority
sched_setparam(pid, param) Delegate to sched_setscheduler with current policy
sched_getscheduler(pid) Query kernel for current policy
sched_rr_get_interval(pid, tp) Return SCHED_RR quantum (default 100ms)
pthread_setschedparam(thread, policy, param) Set kernel sched policy via proc: scheme
pthread_getschedparam(thread, policy, param) Get kernel sched policy
pthread_setschedprio(thread, prio) Set dynamic priority within current policy
pthread_getcpuclockid(thread, clock_id) Return CPU-time clock for thread

Patches: All in local/patches/relibc/P5-sched-complete.patch


Phase R2: Robust and PI Mutex Support (Week 5-9, parallel with S4)

Goal: Full POSIX mutex robustness and priority inheritance.

R2.1 — PI mutex protocol

// relibc/src/sync/pthread_mutex.rs
pub struct PthreadMutex {
    futex: AtomicU32,
    owner: AtomicUsize,     // os_tid of current owner
    pi_waiters: Mutex<Vec<(OsTid, u32)>>, // waiters with requested priority
    flags: AtomicU32,       // PTHREAD_PRIO_INHERIT, PTHREAD_MUTEX_ROBUST
}

// Lock with PI:
fn lock_pi(&self) -> Result<(), Errno> {
    loop {
        match futex::lock_pi(&self.futex) {
            Ok(()) => {
                self.owner.store(current_tid(), Ordering::Release);
                return Ok(());
            }
            Err(EAGAIN) => continue,
            Err(err) => return Err(err),
        }
    }
}

R2.2 — Robust mutex protocol

pub struct RobustList {
    head: *mut RobustListHead,
}

pub struct RobustListHead {
    list: RobustList,
    futex_offset: isize,
    pending: *mut RobustListHead,
}

// On thread exit:
fn handle_robust_list(thread: &Pthread) {
    for entry in thread.robust_list.iter() {
        let futex_addr = (entry as usize + entry.futex_offset) as *mut AtomicU32;
        // Set FUTEX_OWNER_DIED
        futex_addr.fetch_or(FUTEX_OWNER_DIED, Ordering::Release);
        // Wake one waiter with EOWNERDEAD
        futex::wake(futex_addr, 1);
    }
}

Phase R3: Thread Groups and Process Identity (Week 10-12)

Goal: Proper tgid/pid distinction, kill(pid, 0) process targeting.

R3.1 — Kernel thread group concept

pub struct Context {
    pub tgid: usize,  // Thread Group ID (= pid for main thread)
    pub tid: usize,   // Thread ID (unique per thread)
}
  • On clone(CLONE_THREAD): child gets same tgid as parent, new tid
  • On fork: child gets new tgid = child's tid
  • getpid() returns tgid
  • gettid() returns tid
  • kill(tgid, sig) delivers signal to all threads in thread group

R3.2 — Thread group signal delivery

fn deliver_signal_to_thread_group(tgid: usize, sig: Signal) {
    for context in contexts_in_thread_group(tgid) {
        // Pick a thread that hasn't blocked this signal
        if !context.sig_blocked(sig) {
            context.deliver_signal(sig);
            break;
        }
    }
}

Patches:

  • local/patches/kernel/P5-tgid.patch — thread group ID kernel support
  • local/patches/kernel/P5-tgid-signal.patch — process-targeted signal delivery
  • local/patches/relibc/P5-gettid.patch — gettid() syscall

5. Dependency Chain

Phase S1 (observability)
    │
    ├──► Phase S2 (real-time scheduling) ────┐
    │       │                                 │
    │       ├──► Phase R1 (POSIX sched API)   │
    │       │                                 │
    │       └──► KWin input thread priority   │
    │                                         │
    ├──► Phase S3 (load balancing) ───────────┤
    │       │                                 │
    │       └──► Mesa worker thread scaling   │
    │                                         │
    ├──► Phase S4 (futex enhancements) ───────┤
    │       │                                 │
    │       ├──► Phase R2 (PI/robust mutex)   │
    │       │                                 │
    │       └──► KWin compositor lock         │
    │                                         │
    ├──► Phase S5 (dynamic prio + affinity) ──┤
    │       │                                 │
    │       └──► Application CPU pinning      │
    │                                         │
    ├──► Phase R3 (thread groups) ────────────┤
    │       │                                 │
    │       └──► process-targeted signals     │
    │                                         │
    └──► Phase S6 (NUMA) ─────────────────────┘
            │
            └──► Multi-socket server performance

Independent work (can run in parallel):

  • S2 (RT scheduling) + R1 (POSIX sched API) — parallel
  • S4 (futex) + R2 (PI/robust mutex) — parallel
  • S3 (load balancing) can start after S1 but independently of S2
  • S6 (NUMA) depends on S3 (per-CPU queues) but not on S4/S5

6. Integration with Existing Plans

Existing Plan Relationship
KERNEL-IPC-CREDENTIAL-PLAN.md Sibling — this plan covers scheduler + futex + threading; that plan covers credentials + access control + IPC completeness
RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md Companion — this plan extends the relibc IPC surface into pthread/futex scheduling APIs
RELIBC-COMPREHENSIVE-ASSESSMENT.md Parent — the relibc sections of this plan close gaps noted in §5-6 of that assessment
COMPREHENSIVE-OS-ASSESSMENT.md Parent — this plan closes §2 kernel gaps for scheduler/scalability
CONSOLE-TO-KDE-DESKTOP-PLAN.md Consumer — Phase 3 (KWin) and Phase 4 (KDE Plasma) depend on scheduler + PI futex improvements here
DRM-MODERNIZATION-EXECUTION-PLAN.md Sibling — GPU worker thread scheduling benefits from load balancing (S3)
IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md Sibling — IRQ latency affects scheduling latency

7. Patch Governance

All kernel and relibc source changes follow the durability policy (local/AGENTS.md):

local/patches/
├── kernel/
│   (Delivered: P6-* and P7-* patches below. P5-sched-* entries are planned future carriers.)
│   ├── P5-sched-observability.patch       # S1
│   ├── P5-sched-policy.patch              # S2
│   ├── P5-sched-policy-proc.patch         # S2 proc: scheme
│   ├── P6-percpu-runqueues.patch          # S3 (delivered: infrastructure)
│   ├── P6-futex-sharding.patch            # S4 (delivered: sharding)
│   ├── P6-vruntime-context.patch          # S5 (delivered: field + init)
│   ├── P6-vruntime-switch.patch           # S5 (delivered: update + selection)
│   ├── (remaining S3-S6 patches deferred)
├── relibc/
│   ├── P5-sched-observe.patch             # R1 baseline
│   ├── P5-sched-setscheduler.patch        # R1
│   ├── P5-sched-getscheduler.patch        # R1
│   ├── P5-sched-priority.patch            # R1
│   ├── P5-sched-complete.patch            # R1 remaining stubs
│   ├── (PI/robust mutex deferred)          # R2
│   ├── P7-setpriority.patch               # S5 (delivered)
│   ├── P7-pthread-affinity.patch          # S5 (delivered)
│   └── P5-gettid.patch                    # R3

8. Validation and Evidence

8.1 Build Evidence

Check Command
Kernel compiles make r.kernel
relibc compiles make r.relibc
Full OS builds make all CONFIG_NAME=redbear-full

8.2 Runtime Evidence

Test Verification
sched_getscheduler() returns policy redbear-info --sched
pthread_setschedparam() changes priority Threaded test binary: test-sched-priority
RT thread preempts SCHED_OTHER Latency test: RT thread wakes within 100μs
Work stealing across CPUs redbear-info --sched shows balanced queue depths
PI futex prevents priority inversion PI test: low-prio holder, high-prio waiter, medium-prio contester
Robust mutex recovery after thread kill Robust test: kill thread holding mutex, verify EOWNERDEAD
Thread affinity pinning taskset-like test: verify thread stays on assigned CPU
Load balancing on fork bomb Spawn 2× CPUs threads, verify even distribution

8.3 Verification Scripts

local/scripts/test-sched-qemu.sh          # Scheduler metric validation
local/scripts/test-sched-rt-qemu.sh       # Real-time scheduling proof
local/scripts/test-futex-pi-qemu.sh       # PI futex proof
local/scripts/test-futex-robust-qemu.sh   # Robust futex proof
local/scripts/test-sched-balance-qemu.sh  # Load balancing proof (multi-vCPU)

9. Bottom Line

The Redox kernel scheduler is functional but simple — a correct DWRR implementation that works for a lightly-loaded system. For the KDE/Wayland desktop with dozens of competing threads (compositor, rendering, I/O, timers, D-Bus, input), it needs:

  1. Real-time scheduling (S2) — for audio and compositor input threads
  2. PI futexes (S4/R2) — to prevent the compositor lock from being inverted by background work
  3. Load balancing (S3) — to use all available cores efficiently
  4. Dynamic priority (S5) — to keep the compositor responsive under CPU load

These four items are the critical path to a responsive desktop. The remaining items (NUMA, thread groups, robust mutexes, affinity API) are important for correctness and server-class workloads but not desktop-blocking.

Total estimated effort: 13 weeks with 1-2 kernel developers, delivering incremental improvements at each phase boundary.