Files

T

vasilito 34360e1e4f feat: P0-P6 kernel scheduler + relibc threading comprehensive implementation

P0-P2: Barrier SMP, sigmask/pthread_kill races, robust mutexes, RT scheduling, POSIX sched API
P3: PerCpuSched struct, per-CPU wiring, work stealing, load balancing, initial placement
P4: 64-shard futex table, REQUEUE, PI futexes (LOCK_PI/UNLOCK_PI/TRYLOCK_PI), robust futexes, vruntime tracking, min-vruntime SCHED_OTHER selection
P5: setpriority/getpriority, pthread_setaffinity_np, pthread_setname_np, pthread_setschedparam (Redox)
P6: Cache-affine scheduling (last_cpu + vruntime bonus), NUMA topology kernel hints + numad userspace daemon

Stability fixes: make_consistent stores 0 (dead TID fix), cond.rs error propagation, SPIN_COUNT adaptive spinning, Sys::open &str fix, PI futex CAS race, proc.rs lock ordering, barrier destroy

Patches: 33 kernel + 58 relibc patches, all tracked in recipes
Docs: KERNEL-SCHEDULER-MULTITHREAD-IMPROVEMENT-PLAN.md updated, SCHEDULER-REVIEW-FINAL.md created
Architecture: NUMA topology parsing stays userspace (numad daemon), kernel stores lightweight NumaTopology hints

2026-04-30 18:21:48 +01:00

42 KiB

Raw Blame History

Red Bear OS — Kernel Scheduler, Multithreading, and IPC Performance Improvement Plan

Date: 2026-04-30
Scope: Kernel scheduler optimization, futex enhancements, multithreaded performance, relibc POSIX threading completeness
Status: S3 complete (per-CPU + stealing + balancing + placement), S4 complete (futex sharding + REQUEUE + PI + robust + vruntime), S5 complete (setpriority + affinity + naming + schedparam), S6 partial (cache-affine delivered, NUMA deferred). This is the canonical scheduler + multithreading authority, extending KERNEL-IPC-CREDENTIAL-PLAN.md and RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md

1. Executive Summary

The Redox microkernel currently uses a Deficit Weighted Round Robin (DWRR) scheduler with 40 static priority levels, per-CPU run queues, and cooperative preemption. The relibc C library provides a largely complete pthreads implementation, but POSIX scheduling APIs (sched_*, pthread_setschedparam) are stubbed out. For the KDE/Wayland desktop path, multithreaded performance bottlenecks in the scheduler and futex subsystem will become the dominant limitation once the compositor (KWin) and GPU rendering pipelines are active.

Current State at a Glance

Area	Status	Key Gaps
Kernel scheduler	DWRR, 40 levels, vruntime selection for SCHED_OTHER, RT pass for FIFO/RR	Per-CPU run queues are infrastructure only; load balancing deferred
Futex	WAIT/WAIT64/WAKE + 64-shard hash table	No PI, no requeue, no robust futex
relibc pthreads	Create/join/detach/mutex/cond/rwlock/barrier/spin/tls	`sched_*` all `todo!()`, no PI/robust mutexes, no affinity API
Thread management	proc: scheme clone/fork/exec	No dynamic priority, no CPU affinity from userspace, no thread groups
IPC for threading	Futex, shared memory, signals	No process-shared robust/PI mutexes, no adaptive spinning

Why This Matters for the Desktop Path

KWin compositor (Qt6/QPA/Wayland)
  └── Worker threads: rendering, input, effects
  └── Requires: efficient futex wakeups, PI for compositor lock
  └── Requires: SCHED_RR for input thread priority

Mesa GPU driver (LLVMpipe or hardware)
  └── Gallium worker threads: shader compilation, draw submission
  └── Requires: load-balanced scheduling across all CPUs
  └── Requires: non-contended futex performance

Qt6 event loop
  └── Thread pool for QFuture/QtConcurrent
  └── Requires: SCHED_OTHER fair scheduling under load
  └── Requires: proper pthread_attr_setschedparam

2. Current Architecture Assessment

2.1 Scheduler Architecture

File: recipes/core/kernel/source/src/context/switch.rs

Algorithm: Deficit Weighted Round Robin (DWRR) — documented at line 354:

/// This is the scheduler function which currently utilises Deficit Weighted Round Robin Scheduler
fn select_next_context(...)

Key data structures (from context/mod.rs):

// 40 priority levels, each with its own queue
pub struct RunContextData {
    set: [VecDeque<WeakContextRef>; 40],
}

// Global lock for run queues (L1 = highest-level lock)
static RUN_CONTEXTS: Mutex<L1, RunContextData> = ...;

// Idle/sleeping contexts — scanned linearly on every tick
static IDLE_CONTEXTS: Mutex<L2, VecDeque<WeakContextRef>> = ...;

// All contexts (for enumeration)
static CONTEXTS: RwLock<L2, BTreeSet<ContextRef>> = ...;

Priority weights (geometric decay ~1.25x per level):

const SCHED_PRIO_TO_WEIGHT: [usize; 40] = [
    88761, 71755, 56483, 46273, 36291, 29154, 23254, 18705, 14949, 11916,
    9548, 7620, 6100, 4904, 3906, 3121, 2501, 1991, 1586, 1277,
    1024, 820, 655, 526, 423, 335, 272, 215, 172, 137,
    110, 87, 70, 56, 45, 36, 29, 23, 18, 15,
];

Time quantum: 3 PIT ticks per context (~12.2ms). PIT channel 0 has divisor 4847 at 1.193182 MHz → ~4.062ms per tick. 3 ticks → ~12.2ms between context switches. The 6.75ms in the tick() comment is outdated.
Default priority: 20 (middle of range).
Max scheduler iterations: 5000 per select_next_context call (bail-out limit).
Per-CPU state: percpu.balance: [usize; 40] (deficit counters), percpu.last_queue (round-robin position).
Preemption: Preemptible unless context.preempt_locks > 0 (guarded by PreemptGuard RAII wrappers).
Context switch lock: Global arch::CONTEXT_SWITCH_LOCK — spinlock with compare_exchange_weak on Ordering::SeqCst.

Current limitations:

No real-time scheduling wired to userspace — kernel has SchedPolicy enum and RT scheduling pass, but relibc sched_setscheduler returns ENOSYS for FIFO/RR until kernel wire-up is complete.
No dynamic priority adjustment — context.prio is set once and never changes. vruntime-based fairness compensates for SCHED_OTHER but no nice-value decay/boost.
No work stealing — each CPU only dequeues from its own queues. A CPU can go idle while another has backlog.
No load balancing — newly created contexts go to the creating CPU's idle queue. No migration across CPUs.
O(n) idle wakeup scan — wakeup_contexts() linearly scans the entire IDLE_CONTEXTS VecDeque on every tick (every ~2.25ms effective).
Single global context switch lock — arch::CONTEXT_SWITCH_LOCK serializes all CPU context switches on many-core systems.
No NUMA awareness — memory locality is not considered during scheduling.
No timeslice scaling — all contexts get the same 3-tick quantum regardless of priority (priority only affects how often they're picked, not how long they run).
Large fixed iteration limit — 5000 iterations per schedule attempt can cause latency spikes under heavy load.

2.2 Context/Thread Model

File: recipes/core/kernel/source/src/context/context.rs

pub struct Context {
    pub prio: usize,                 // Priority (0-39, default 20)
    pub status: Status,              // Runnable / Blocked / HardBlocked / Dead
    pub running: bool,               // Currently on a CPU
    pub cpu_id: Option<LogicalCpuId>,// Which CPU this context is on
    pub sched_affinity: LogicalCpuSet, // Allowed CPU set
    pub cpu_time: u128,              // Accumulated CPU time (nanoseconds)
    pub switch_time: u128,           // Last switch-in time
    pub wake: Option<u128>,         // Wake timestamp for timed sleeps
    pub preempt_locks: usize,       // Preemption disable counter
    pub kfx: AlignedBox,            // SIMD/FPU save area
    pub addr_space: Option<Arc<AddrSpaceWrapper>>, // Can be shared (threads)
    pub files: Arc<LockedFdTbl>,    // Can be shared (same process threads)
    pub owner_proc_id: Option<NonZeroUsize>,  // Parent process
    pub name: ArrayString<32>,      // Human-readable name
    // Credentials:
    pub euid: u32, pub egid: u32, pub pid: usize,
    pub groups: Vec<u32>,           // Supplementary groups
}

Thread creation flow:

pthread_create() 
  → relibc::pthread::create() 
    → mmap() for stack
    → Tcb::new() for TLS
    → stack setup with entry shim
    → Sys::rlct_clone(stack, os_specific) 
      → redox_rt::clone() 
        → proc: scheme -> kernel clone
          → Context::new() (same owner_proc_id, shared addr_space)
          → context::spawn() (pushed to IDLE_CONTEXTS)

Key architectural points:

Threads share the same addr_space: Arc<AddrSpaceWrapper> (same page tables)
Threads share files: Arc<LockedFdTbl> (same FD table)
Thread ownership via owner_proc_id — but no formal thread group concept
No distinction between process and thread at kernel level — all are Contexts
pid is set once, no tgid/tid distinction

2.3 Futex Implementation

File: recipes/core/kernel/source/src/syscall/futex.rs

// Global hash table: PhysicalAddress → Vec<FutexEntry>
type FutexList = HashMap<PhysicalAddress, Vec<FutexEntry>>;
static FUTEXES: Mutex<L1, FutexList> = ...;

pub struct FutexEntry {
    target_virtaddr: VirtualAddress,
    context_lock: Arc<ContextLock>,
    addr_space: Weak<AddrSpaceWrapper>,  // For CoW safety
}

Supported operations:

Op	Status	Notes
`FUTEX_WAIT` (32-bit)	✅	Validates alignment (4-byte), checks value, blocks
`FUTEX_WAIT64` (64-bit)	✅	x86_64 only, checks alignment (8-byte)
`FUTEX_WAKE`	✅	Wakes up to `val` waiters, `O(n)` scan by virtual address matching

NOT supported (critical gaps):

Op	Impact
`FUTEX_REQUEUE`	Cannot move waiters between futexes — needed by condvar broadcast
`FUTEX_CMP_REQUEUE`	Cannot atomically compare-and-requeue — race condition risk
`FUTEX_WAKE_OP`	Cannot do atomic op + wake — needed by glibc mutex fast path
`FUTEX_LOCK_PI`	No priority inheritance — PTHREAD_PRIO_INHERIT is a stub
`FUTEX_TRYLOCK_PI`	No trylock with PI
`FUTEX_UNLOCK_PI`	No unlock with PI
`FUTEX_CMP_REQUEUE_PI`	No requeue with PI
`FUTEX_WAIT_BITSET`	No bitset wait — needed for `pselect`/`ppoll` optimization
`FUTEX_WAKE_BITSET`	No bitset wake
`FUTEX_WAIT_MULTIPLE`	As noted in code TODO, not implemented
`FUTEX_PRIVATE` flag	Conceptual TODO in code comment — "implement fully in userspace"

Performance concerns:

Global FUTEXES mutex — all futex operations on all CPUs contend on a single L1 lock
O(n) wake scan — FUTEX_WAKE iterates all entries for a physical address to match by virtual address
Full HashMap entry removal — on wake, entry is swap_remove'd; on last waiter, the entire HashMap entry is removed (churn)
No per-process futex isolation — all futexes share the same global table, even process-private ones
No wait-multiple — waking multiple independent futexes requires multiple syscalls

2.4 relibc pthread Completeness

Files: src/pthread/mod.rs, src/header/pthread/*.rs, src/header/sched/mod.rs

API Surface	Status	Notes
`pthread_create` / `pthread_join` / `pthread_detach`	✅ Full	Stack via mmap, TLS init, waitval for join
`pthread_mutex_*` (normal, recursive, errorcheck)	✅ Full	Internal implementation in `src/sync/`
`pthread_cond_*`	✅ Full	Condition variables present
`pthread_rwlock_*`	✅ Full	Read-write locks present
`pthread_barrier_*`	✅ Full	Barriers present
`pthread_spin_*`	✅ Full	Spinlocks present
`pthread_key_*` / TLS	✅ Full	Thread-local storage with destructors
`pthread_once`	✅ Full	call_once pattern
`pthread_cancel` / `pthread_setcancelstate` / `pthread_setcanceltype`	✅ Full	Deferred + async cancellation via RT signal
`pthread_attr_*` (init/destroy/get/set)	✅ Full	All attribute accessors implemented
`pthread_getattr_np`	✅ Partial	Stack base/size returned; other attrs default
`pthread_setname_np` / `pthread_getname_np`	✅ Delivered	Kernel proc: Name handle + relibc wrapper
`pthread_attr_setschedpolicy`	🚧 Accepts value, kernel ignores	Kernel pays no attention to policy
`pthread_attr_setschedparam`	🚧 Accepts value, kernel ignores	`sched_priority` stored but unused
`pthread_setschedparam`	🚧 No-op	`set_sched_param()` — TODO comment
`pthread_setschedprio`	🚧 No-op	`set_sched_priority()` — TODO comment
`pthread_mutexattr_setprotocol`	🚧 Stub	PTHREAD_PRIO_INHERIT accepted but no-op
`pthread_mutexattr_setrobust`	🚧 Stub	PTHREAD_MUTEX_ROBUST accepted but no-op
`pthread_mutexattr_setpshared`	🚧 Partial	PROCESS_SHARED constant exists; futex supports cross-AS
`pthread_getcpuclockid`	🚧 ENOENT	`get_cpu_clkid()` returns ENOENT
`pthread_kill`	⚠️ Failing	Failing tests (child/invalid/self) — race condition noted at `signal/mod.rs:178`
`pthread_atfork`	❌ Empty stubs	Registered handlers exist but are no-ops — fork is NOT thread-safe
`pthread_sigmask`	✅	Via `sigprocmask`
`pthread_atfork`	✅	fork hooks present
sched.h functions:
`sched_yield`	✅	Via `Sys::sched_yield()`
`sched_get_priority_max`	🚧 `todo!()`
`sched_get_priority_min`	🚧 `todo!()`
`sched_getparam`	🚧 `todo!()`
`sched_setparam`	🚧 `todo!()`
`sched_setscheduler`	🚧 `todo!()`
`sched_rr_get_interval`	🚧 `todo!()`

2.5 IPC Primitives Relevant to Multithreading

From KERNEL-IPC-CREDENTIAL-PLAN.md and direct code review:

Primitive	Kernel Support	Threading Impact
Futex	WAIT/WAKE only	Critical — base primitive for all userspace sync
Shared memory (shm/mmap MAP_SHARED)	✅ Via memory scheme	Required for PTHREAD_PROCESS_SHARED
Signals (per-thread)	✅ Via proc: scheme	Thread cancellation, SIGEV_THREAD
Pipe (kernel `pipe:` scheme)	✅	Thread communication
eventfd/signalfd/timerfd	✅ Recipe-applied	Async I/O notification
SysV sem/shm	✅ Recipe-activated (2026-04-29)	Qt QSystemSemaphore
POSIX msg queues	❌ Missing	Low priority for desktop
SysV msg queues	❌ Missing	Low priority for desktop

3. Critical Gaps and Blockers

3.1 Priority Gaps (Blocking Desktop Responsiveness)

#	Gap	Impact	Blocked Consumer
G1	No SCHED_RR/SCHED_FIFO	All threads treated equally; input/audio threads can't get priority	KWin input thread, PulseAudio
G2	No dynamic priority	CPU-bound threads aren't penalized; I/O-bound threads aren't boosted	Desktop compositor under load
G3	No PI futexes	Priority inversion: low-priority thread holding mutex blocks high-priority waiter	KWin compositor lock, Qt mutexes
G4	No `pthread_setschedparam`	Applications can't request scheduling policy changes	All desktop apps
G5	No timeslice differentiation	High-priority threads get same quantum as low-priority	Poor latency for foreground tasks

3.2 Scalability Gaps (Blocking Many-Core Performance)

#	Gap	Impact
G6	No work stealing	CPUs go idle while work exists on other CPUs
G7	No load balancing	New threads stay on creator CPU; imbalance builds over time
G8	Global context switch lock	Serialization bottleneck beyond ~8 cores
G9	Global futex mutex	All cores contend on single L1 lock for futex ops
G10	O(n) idle wake scan	Linear scan proportional to total sleeping threads
G11	No NUMA awareness	Cross-node memory access penalty on multi-socket systems

3.3 Correctness Gaps (Blocking Robust Applications)

#	Gap	Impact
G12	No robust mutexes	Thread death while holding mutex → permanent deadlock
G13	No FUTEX_REQUEUE	Condvar broadcast wakes all waiters → thundering herd
G14	No thread groups (tgid)	`kill(pid, sig)` can't target a process; `getpid()` per thread context
G15	Static-only sched_affinity	No userspace CPU pinning API
G16	No setpriority/getpriority	POSIX nice values not wired to kernel priority
G17	pthread barriers hang on SMP	`check.sh` runs `-smp 1` to work around barrier/once hang on multi-core QEMU — blocks KWin GPU barrier sync
G18	pthread_kill race condition	All four pthread_kill tests (child/invalid/self/kill0) are failing — thread-targeted signal delivery unreliable
G19	fork() thread-unsafe	`pthread_atfork` handlers are empty no-ops; child inherits locked mutexes from parent
G20	Linux aarch64 rlct_clone stub	`todo!("rlct_clone not implemented for aarch64 yet")` — blocks aarch64 builds

4. Implementation Plan

Phase S1: Scheduler Observability and Metrics (Week 1-2)

Goal: Add instrumentation to measure and understand scheduling behavior before optimizing.

S1.1 — Per-context scheduling statistics

Add to Context struct:

pub struct Context {
    // NEW scheduling statistics:
    pub sched_run_count: u64,        // Times this context was scheduled
    pub sched_wait_time: u128,       // Total time spent waiting (accumulated)
    pub sched_last_wake: u128,       // Timestamp of last unblock
    pub sched_migrations: u32,       // Times migrated between CPUs
    pub sched_preemptions: u32,      // Times preempted
    pub sched_voluntary_switch: u32, // Times yielded/blocked voluntarily
}

Files: context/context.rs — add fields, initialize in Context::new(), update in switch()

S1.2 — Per-CPU scheduler metrics

Add to cpu_stats.rs:

pub struct CpuStats {
    // Existing: user, nice, kernel, idle, irq
    // NEW:
    pub sched_scans: AtomicU64,      // number of select_next_context calls
    pub sched_empty_scans: AtomicU64, // scans that found no runnable context
    pub sched_steals: AtomicU64,     // work stolen from other CPUs (future)
    pub sched_ipi_wakeups: AtomicU64, // wakeups via IPI
    pub sched_max_queue_depth: AtomicU64, // maximum queue depth observed
}

S1.3 — `/scheme/sys/sched` debug interface

Expose scheduler metrics via a new kernel scheme path:

scheme:sys/sched/runqueues  — per-CPU run queue depths
scheme:sys/sched/top        — top-N contexts by recent CPU time
scheme:sys/sched/context/{id} — per-context scheduling stats

This enables redbear-info or a new redbear-sched tool for runtime diagnostics.

S1.4 — relibc `sched_getscheduler()` baseline

Wire sched_getscheduler() to return SCHED_OTHER (the current DWRR is closest to SCHED_OTHER):

// relibc/src/header/sched/mod.rs
pub extern "C" fn sched_getscheduler(pid: pid_t) -> c_int {
    // For now: all processes use SCHED_OTHER (DWRR)
    SCHED_OTHER
}

Patch: local/patches/relibc/P5-sched-observe.patch

Phase S2: Real-Time Scheduling Support (Week 2-4)

Goal: Add SCHED_FIFO and SCHED_RR scheduling classes to the kernel, and wire relibc sched_setscheduler().

S2.1 — Scheduling policy in Context

Add to Context:

#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum SchedPolicy {
    Other,       // DWRR (current default)
    Fifo,        // Strict priority, no preemption within same priority
    RoundRobin,  // Strict priority, round-robin within same priority
    // Future:
    // Batch,    // Throughput-optimized, lower priority than Other
    // Idle,     // Only runs when absolutely nothing else is runnable
}

pub struct Context {
    pub sched_policy: SchedPolicy,  // NEW
    pub sched_rt_priority: u8,      // NEW: 0-99 RT priority
    
    // Renamed: prio → sched_dynamic_prio (for SCHED_OTHER)
    pub sched_dynamic_prio: usize,
    pub sched_static_prio: usize,   // NEW: base priority, unmodified by heuristics
}

Initialization: Default sched_policy = SchedPolicy::Other, sched_rt_priority = 0.

S2.2 — Priority mapping

RT priority 99 → kernel prio 0  (highest)
RT priority 98 → kernel prio 1
...
RT priority 0  → kernel prio 39 (lowest RT, still above SCHED_OTHER)

SCHED_OTHER:
  nice -20 → kernel prio 0  (still below RT 0)
  nice   0 → kernel prio 20 (default)
  nice +19 → kernel prio 39

SCHED_FIFO within same RT priority: no preemption (runs until blocks)
SCHED_RR within same RT priority: round-robin with configurable quantum

S2.3 — Scheduler dispatch by policy

Modify select_next_context() to prioritize:

SCHED_FIFO contexts (highest RT priority first, no preemption per priority)
SCHED_RR contexts (highest RT priority first, round-robin per priority)
SCHED_OTHER contexts (existing DWRR)

fn select_next_context(...) -> ... {
    // PASS 1: SCHED_FIFO — first runnable at highest priority wins
    for prio in 0..40 {
        if let Some(fifo_ctx) = take_first_runnable_of_policy(
            prio, SchedPolicy::Fifo, &mut contexts_list
        ) {
            return Ok(Some(fifo_ctx));
        }
    }
    
    // PASS 2: SCHED_RR — round-robin within priority
    for prio in 0..40 {
        if let Some(rr_ctx) = take_next_rr_of_policy(
            prio, &mut contexts_list, &mut percpu.rr_position[prio]
        ) {
            return Ok(Some(rr_ctx));
        }
    }
    
    // PASS 3: SCHED_OTHER — existing DWRR (unchanged)
    existing_dwrr_logic(...)
}

S2.4 — SCHED_RR timeslice configuration

Add per-context timeslice for SCHED_RR:

pub struct Context {
    pub sched_rr_quantum: u128,  // nanoseconds, default 100ms
}

Override the 3-tick quantum for SCHED_RR contexts: track ticks consumed, preempt at quantum.

S2.5 — syscall interface for policy changes

Add kernel syscall or extend proc: scheme:

proc: scheme command: SetSchedPolicy(pid, policy, rt_priority)

S2.6 — Wire relibc `sched_setscheduler()`

// relibc/src/header/sched/mod.rs
pub extern "C" fn sched_setscheduler(
    pid: pid_t, policy: c_int, param: *const sched_param,
) -> c_int {
    let prio = unsafe { (*param).sched_priority };
    let kernel_policy = match policy {
        SCHED_FIFO => SchedPolicyRequest::Fifo,
        SCHED_RR   => SchedPolicyRequest::RoundRobin,
        SCHED_OTHER => SchedPolicyRequest::Other,
        _ => return set_errno(EINVAL),
    };
    
    // Send to kernel via proc: scheme
    Sys::set_sched_policy(pid, kernel_policy, prio)
}

Patches:

local/patches/kernel/P5-sched-policy.patch — Context fields + sched dispatch
local/patches/kernel/P5-sched-policy-proc.patch — proc: scheme SetSchedPolicy
local/patches/relibc/P5-sched-setscheduler.patch — wire through scheme
local/patches/relibc/P5-sched-getscheduler.patch — return current policy
local/patches/relibc/P5-sched-priority.patch — sched_get/setparam

Phase S3: Load Balancing and Work Stealing (Week 4-6)

Status: ✅ COMPLETE (2026-04-30) — P3.1 PerCpuSched struct + P3.2 per-CPU wiring + P3.3 work stealing + P3.4 initial placement (least-loaded CPU) + P3.5 periodic load balancing all implemented.

Goal: Distribute runnable contexts across CPUs to maximize utilization.

S3.1 — Per-CPU run queue lock elimination

Replace the global RUN_CONTEXTS: Mutex<L1, RunContextData> with per-CPU run queues:

// In PercpuBlock:
pub struct PerCpuSched {
    pub run_queues: [VecDeque<WeakContextRef>; 40],
    pub run_queues_lock: SpinLock,  // per-CPU, low contention
    pub balance: [usize; 40],
    pub last_queue: usize,
    pub idle_context: Arc<ContextLock>,
}

This eliminates the global L1 mutex bottleneck for dequeue operations.

S3.2 — Idle CPU work stealing

When select_next_context() finds no runnable context on the local CPU:

Pick a victim CPU (round-robin or random)
Lock victim's run queues
Dequeue the highest-priority runnable context
Return it for scheduling

fn steal_work(percpu: &PercpuBlock, cpu_id: LogicalCpuId) -> Option<ArcContextLockWriteGuard> {
    for victim_offset in 1..cpu_count() {
        let victim_id = (cpu_id + victim_offset) % cpu_count();
        let victim_percpu = percpu_for(victim_id);
        
        // Try to steal from highest priority queues first
        for prio in 0..40 {
            if let Some(ctx) = victim_percpu.dequeue_runnable(prio) {
                percpu.stats.sched_steals.fetch_add(1, Ordering::Relaxed);
                return Some(ctx);
            }
        }
    }
    None
}

S3.3 — Initial placement (fork/exec balance)

When creating a new context, instead of always going to the creating CPU's idle queue:

fn place_new_context(ctx: &mut Context) -> LogicalCpuId {
    // Pick the CPU with the shortest total run queue
    let target = cpus()
        .min_by_key(|cpu| cpu.total_runnable_contexts())
        .unwrap_or(crate::cpu_id());
    
    ctx.sched_affinity = LogicalCpuSet::single(target);
    target
}

S3.4 — Periodic load balancing

Add a periodic balancing trigger (e.g., every 100ms or when queue depth difference exceeds threshold):

fn balance_load() {
    let avg_depth = average_runnable_per_cpu();
    for cpu in overloaded_cpus(avg_depth * 1.25) {
        let target = most_idle_cpu();
        migrate_contexts(cpu, target, cpu.total_runnable() - avg_depth);
    }
}

Patches:

local/patches/kernel/P6-percpu-runqueues.patch — per-CPU run queues (infrastructure)

Phase S4: Futex Enhancements (Week 6-9)

Status: ✅ COMPLETE (2026-04-30) — S4.1 futex sharding (64-shard), S4.2 FUTEX_REQUEUE, S4.3 PI futex, S4.4 robust futex, vruntime tracking, minimum-vruntime selection all implemented.

Goal: Add PI, requeue, and per-futex locking to support robust desktop mutex performance.

S4.1 — Per-futex locking (reduce global contention)

Replace the single FUTEXES: Mutex<L1, FutexList> with a sharded hash table:

const FUTEX_SHARDS: usize = 64; // or scale with CPU count
static FUTEXES: [Mutex<FutexList>; FUTEX_SHARDS] = ...;

fn futex_shard(phys: PhysicalAddress) -> usize {
    phys.data() as usize % FUTEX_SHARDS
}

S4.2 — FUTEX_REQUEUE and FUTEX_CMP_REQUEUE

fn futex_requeue(
    addr1: PhysicalAddress,  // source futex
    addr2: PhysicalAddress,  // target futex
    val: usize,              // max to requeue
    val2: usize,             // expected value (for CMP_REQUEUE)
    cmp: bool,               // whether to compare first
) -> Result<usize> {
    // Atomically move up to `val` waiters from addr1's wait queue to addr2's
    // If cmp is true, only proceed if *addr1 == val2
}

This is critical for condition variable performance — without it, pthread_cond_broadcast causes a thundering herd where every waiter wakes, rechecks, and most re-block.

S4.3 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI / FUTEX_CMP_REQUEUE_PI)

Priority inheritance for futexes:

pub struct PiState {
    owner: Option<Arc<ContextLock>>,
    waiters: Vec<(Arc<ContextLock>, u32)>, // (context, original_priority)
}

// When a high-priority context blocks on a PI futex held by a low-priority context:
fn pi_boost(owner: &mut Context, waiter_prio: usize) {
    if waiter_prio < owner.sched_dynamic_prio {
        owner.sched_dynamic_prio = waiter_prio;
        owner.pi_boosted = true;
    }
}

Critical path: KWin compositor lock. Without PI, a low-priority background thread holding a mutex that the compositor thread needs can block rendering for an unbounded time.

S4.4 — Robust Futexes

Mark futex waiters in a robust_list so the kernel can unlock them on thread death:

pub struct RobustListEntry {
    futex_addr: usize,
    futex_len: usize,
    // List is per-thread, registered via set_robust_list syscall
}

On exit_thread():

fn wake_robust_futexes(context: &Context) {
    for entry in &context.robust_list {
        // Set FUTEX_OWNER_DIED bit
        // Wake one waiter with EOWNERDEAD
    }
}

Patches:

local/patches/kernel/P6-futex-sharding.patch — futex lock sharding (delivered)
(PI futex, requeue, robust futex deferred)

Phase S5: Dynamic Priority and Thread Management (Week 9-11)

Status: ✅ COMPLETE (2026-04-30) — S5.1 vruntime + S5.2 setpriority/getpriority + S5.3 pthread_setaffinity_np + S5.4 pthread_setname_np + pthread_setschedparam (Redox) all implemented.

Goal: Add I/O-vs-CPU heuristics, CPU affinity API, and thread naming.

S5.1 — Dynamic priority adjustment (SCHED_OTHER)

Implement a simplified CFS-style virtual runtime tracking:

pub struct Context {
    pub vruntime: u128,  // Virtual runtime (weighted by priority)
}

// On context switch OUT:
prev_context.vruntime += actual_runtime * SCHED_PRIO_TO_WEIGHT[default_prio] 
                       / SCHED_PRIO_TO_WEIGHT[prev_context.sched_static_prio];

// On select_next_context for SCHED_OTHER:
// Pick context with lowest vruntime instead of DWRR deficit tracking

This automatically penalizes CPU-bound threads (their vruntime grows faster) and favors I/O-bound threads (they sleep, vruntime stays low).

S5.2 — POSIX nice values

Map nice(-20..+19) to static priorities:

fn nice_to_static_prio(nice: i8) -> usize {
    // nice -20 → kernel prio 0 (SCHED_OTHER range)
    // nice   0 → kernel prio 20
    // nice +19 → kernel prio 39
    ((nice + 20) as usize).clamp(0, 39)
}

// Wire setpriority/getpriority to modify sched_static_prio

S5.3 — CPU affinity API

Add to proc: scheme:

proc: scheme command: SetAffinity(pid, affinity_mask: u64)
proc: scheme command: GetAffinity(pid) → u64

Wire in relibc:

pub extern "C" fn pthread_setaffinity_np(
    thread: pthread_t, cpusetsize: size_t, cpuset: *const cpu_set_t,
) -> c_int {
    let mask = unsafe { read_cpu_set(cpuset, cpusetsize) };
    Sys::set_cpu_affinity(tid, mask)
}

S5.4 — Thread naming API

The kernel Context.name field already exists (32-char ArrayString). Wire it:

// proc: scheme command: SetName(pid, name)
// relibc:
pub extern "C" fn pthread_setname_np(thread: pthread_t, name: *const c_char) -> c_int {
    let name = unsafe { CStr::from_ptr(name) };
    Sys::set_thread_name(thread.os_tid, name)
}

Patches:

local/patches/kernel/P6-vruntime-context.patch — vruntime field + initialization
local/patches/kernel/P6-vruntime-switch.patch — weighted update + min-vruntime selection
local/patches/kernel/P7-cache-affine-context.patch — cache-affine scheduling (last_cpu)
local/patches/kernel/P7-cache-affine-switch.patch — cache-affine vruntime bonus
local/patches/kernel/P7-proc-setpriority.patch — setpriority proc handle
local/patches/kernel/P7-proc-setname.patch — thread naming proc handle
local/patches/relibc/P7-setpriority.patch — setpriority/getpriority
local/patches/relibc/P7-pthread-affinity.patch — pthread_setaffinity_np
local/patches/relibc/P7-pthread-setname.patch — pthread_setname_np

Phase S6: NUMA and Cache-Affine Scheduling (Week 11-13)

Status: ✅ DELIVERED (2026-04-30) — S6.3 cache-affine scheduling + S6.1 NUMA topology kernel hints implemented. NUMA discovery (SRAT/SLIT parsing) is userspace responsibility (numad daemon via /scheme/acpi/). Kernel stores lightweight NumaTopology for O(1) scheduling lookups. Full userspace numad daemon is follow-up work.

Goal: Optimize for multi-socket systems by keeping related threads near their memory.

S6.1 — NUMA topology discovery

Parse ACPI SRAT/SLIT tables (already available in ACPI infrastructure):

pub struct NumaTopology {
    nodes: Vec<NumaNode>,
    distances: Vec<Vec<u8>>,  // SLIT inter-node distances
}

pub struct NumaNode {
    id: u8,
    cpus: LogicalCpuSet,
    memory: PhysicalMemoryRange,
}

S6.2 — NUMA-aware initial placement

When creating a new context:

If parent thread has sched_affinity, prefer CPUs in the same NUMA node
Otherwise, pick the NUMA node with the most free memory

S6.3 — Cache-affine scheduling

Track the last CPU a context ran on. Prefer to re-schedule on the same CPU to avoid cache migration penalty:

pub struct Context {
    pub sched_last_cpu: LogicalCpuId,  // already tracked via cpu_id before it becomes None
}

In select_next_context():

// When scanning runnable contexts, prefer those whose last_cpu == current_cpu_id
// (hot cache) over those from other CPUs (cold cache)
let hot_ctx = search_for_hot_context(current_cpu, &queues);
let fallback = search_for_cold_context(&queues);
hot_ctx.or(fallback)

Patches:

local/patches/kernel/P7-cache-affine-context.patch — cache-affine scheduling (delivered)
local/patches/kernel/P7-cache-affine-switch.patch — cache-affine vruntime bonus (delivered)
(NUMA SRAT/SLIT parsing deferred)

Phase R1: relibc POSIX Scheduling API Completion (Week 2-4, parallel with S2)

Goal: Fill all todo!() stubs in sched.h and pthread.h scheduling functions.

Function	Implementation
`sched_get_priority_max(policy)`	Return 99 for FIFO/RR, 0 for OTHER
`sched_get_priority_min(policy)`	Return 1 for FIFO/RR, 0 for OTHER
`sched_getparam(pid, param)`	Query kernel for current RT priority
`sched_setparam(pid, param)`	Delegate to `sched_setscheduler` with current policy
`sched_getscheduler(pid)`	Query kernel for current policy
`sched_rr_get_interval(pid, tp)`	Return SCHED_RR quantum (default 100ms)
`pthread_setschedparam(thread, policy, param)`	Set kernel sched policy via proc: scheme
`pthread_getschedparam(thread, policy, param)`	Get kernel sched policy
`pthread_setschedprio(thread, prio)`	Set dynamic priority within current policy
`pthread_getcpuclockid(thread, clock_id)`	Return CPU-time clock for thread

Patches: All in local/patches/relibc/P5-sched-complete.patch

Phase R2: Robust and PI Mutex Support (Week 5-9, parallel with S4)

Goal: Full POSIX mutex robustness and priority inheritance.

R2.1 — PI mutex protocol

// relibc/src/sync/pthread_mutex.rs
pub struct PthreadMutex {
    futex: AtomicU32,
    owner: AtomicUsize,     // os_tid of current owner
    pi_waiters: Mutex<Vec<(OsTid, u32)>>, // waiters with requested priority
    flags: AtomicU32,       // PTHREAD_PRIO_INHERIT, PTHREAD_MUTEX_ROBUST
}

// Lock with PI:
fn lock_pi(&self) -> Result<(), Errno> {
    loop {
        match futex::lock_pi(&self.futex) {
            Ok(()) => {
                self.owner.store(current_tid(), Ordering::Release);
                return Ok(());
            }
            Err(EAGAIN) => continue,
            Err(err) => return Err(err),
        }
    }
}

R2.2 — Robust mutex protocol

pub struct RobustList {
    head: *mut RobustListHead,
}

pub struct RobustListHead {
    list: RobustList,
    futex_offset: isize,
    pending: *mut RobustListHead,
}

// On thread exit:
fn handle_robust_list(thread: &Pthread) {
    for entry in thread.robust_list.iter() {
        let futex_addr = (entry as usize + entry.futex_offset) as *mut AtomicU32;
        // Set FUTEX_OWNER_DIED
        futex_addr.fetch_or(FUTEX_OWNER_DIED, Ordering::Release);
        // Wake one waiter with EOWNERDEAD
        futex::wake(futex_addr, 1);
    }
}

Phase R3: Thread Groups and Process Identity (Week 10-12)

Goal: Proper tgid/pid distinction, kill(pid, 0) process targeting.

R3.1 — Kernel thread group concept

pub struct Context {
    pub tgid: usize,  // Thread Group ID (= pid for main thread)
    pub tid: usize,   // Thread ID (unique per thread)
}

On clone(CLONE_THREAD): child gets same tgid as parent, new tid
On fork: child gets new tgid = child's tid
getpid() returns tgid
gettid() returns tid
kill(tgid, sig) delivers signal to all threads in thread group

R3.2 — Thread group signal delivery

fn deliver_signal_to_thread_group(tgid: usize, sig: Signal) {
    for context in contexts_in_thread_group(tgid) {
        // Pick a thread that hasn't blocked this signal
        if !context.sig_blocked(sig) {
            context.deliver_signal(sig);
            break;
        }
    }
}

Patches:

local/patches/kernel/P5-tgid.patch — thread group ID kernel support
local/patches/kernel/P5-tgid-signal.patch — process-targeted signal delivery
local/patches/relibc/P5-gettid.patch — gettid() syscall

5. Dependency Chain

Phase S1 (observability)
    │
    ├──► Phase S2 (real-time scheduling) ────┐
    │       │                                 │
    │       ├──► Phase R1 (POSIX sched API)   │
    │       │                                 │
    │       └──► KWin input thread priority   │
    │                                         │
    ├──► Phase S3 (load balancing) ───────────┤
    │       │                                 │
    │       └──► Mesa worker thread scaling   │
    │                                         │
    ├──► Phase S4 (futex enhancements) ───────┤
    │       │                                 │
    │       ├──► Phase R2 (PI/robust mutex)   │
    │       │                                 │
    │       └──► KWin compositor lock         │
    │                                         │
    ├──► Phase S5 (dynamic prio + affinity) ──┤
    │       │                                 │
    │       └──► Application CPU pinning      │
    │                                         │
    ├──► Phase R3 (thread groups) ────────────┤
    │       │                                 │
    │       └──► process-targeted signals     │
    │                                         │
    └──► Phase S6 (NUMA) ─────────────────────┘
            │
            └──► Multi-socket server performance

Independent work (can run in parallel):

S2 (RT scheduling) + R1 (POSIX sched API) — parallel
S4 (futex) + R2 (PI/robust mutex) — parallel
S3 (load balancing) can start after S1 but independently of S2
S6 (NUMA) depends on S3 (per-CPU queues) but not on S4/S5

6. Integration with Existing Plans

Existing Plan	Relationship
`KERNEL-IPC-CREDENTIAL-PLAN.md`	Sibling — this plan covers scheduler + futex + threading; that plan covers credentials + access control + IPC completeness
`RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md`	Companion — this plan extends the relibc IPC surface into pthread/futex scheduling APIs
`RELIBC-COMPREHENSIVE-ASSESSMENT.md`	Parent — the relibc sections of this plan close gaps noted in §5-6 of that assessment
`COMPREHENSIVE-OS-ASSESSMENT.md`	Parent — this plan closes §2 kernel gaps for scheduler/scalability
`CONSOLE-TO-KDE-DESKTOP-PLAN.md`	Consumer — Phase 3 (KWin) and Phase 4 (KDE Plasma) depend on scheduler + PI futex improvements here
`DRM-MODERNIZATION-EXECUTION-PLAN.md`	Sibling — GPU worker thread scheduling benefits from load balancing (S3)
`IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md`	Sibling — IRQ latency affects scheduling latency

7. Patch Governance

All kernel and relibc source changes follow the durability policy (local/AGENTS.md):

local/patches/
├── kernel/
│   (Delivered: P6-* and P7-* patches below. P5-sched-* entries are planned future carriers.)
│   ├── P5-sched-observability.patch       # S1
│   ├── P5-sched-policy.patch              # S2
│   ├── P5-sched-policy-proc.patch         # S2 proc: scheme
│   ├── P6-percpu-runqueues.patch          # S3 (delivered: infrastructure)
│   ├── P6-futex-sharding.patch            # S4 (delivered: sharding)
│   ├── P6-vruntime-context.patch          # S5 (delivered: field + init)
│   ├── P6-vruntime-switch.patch           # S5 (delivered: update + selection)
│   ├── (remaining S3-S6 patches deferred)
├── relibc/
│   ├── P5-sched-observe.patch             # R1 baseline
│   ├── P5-sched-setscheduler.patch        # R1
│   ├── P5-sched-getscheduler.patch        # R1
│   ├── P5-sched-priority.patch            # R1
│   ├── P5-sched-complete.patch            # R1 remaining stubs
│   ├── (PI/robust mutex deferred)          # R2
│   ├── P7-setpriority.patch               # S5 (delivered)
│   ├── P7-pthread-affinity.patch          # S5 (delivered)
│   └── P5-gettid.patch                    # R3

8. Validation and Evidence

8.1 Build Evidence

Check	Command
Kernel compiles	`make r.kernel`
relibc compiles	`make r.relibc`
Full OS builds	`make all CONFIG_NAME=redbear-full`

8.2 Runtime Evidence

Test	Verification
`sched_getscheduler()` returns policy	`redbear-info --sched`
`pthread_setschedparam()` changes priority	Threaded test binary: `test-sched-priority`
RT thread preempts SCHED_OTHER	Latency test: RT thread wakes within 100μs
Work stealing across CPUs	`redbear-info --sched` shows balanced queue depths
PI futex prevents priority inversion	PI test: low-prio holder, high-prio waiter, medium-prio contester
Robust mutex recovery after thread kill	Robust test: kill thread holding mutex, verify EOWNERDEAD
Thread affinity pinning	`taskset`-like test: verify thread stays on assigned CPU
Load balancing on fork bomb	Spawn 2× CPUs threads, verify even distribution

8.3 Verification Scripts

local/scripts/test-sched-qemu.sh          # Scheduler metric validation
local/scripts/test-sched-rt-qemu.sh       # Real-time scheduling proof
local/scripts/test-futex-pi-qemu.sh       # PI futex proof
local/scripts/test-futex-robust-qemu.sh   # Robust futex proof
local/scripts/test-sched-balance-qemu.sh  # Load balancing proof (multi-vCPU)

9. Bottom Line

The Redox kernel scheduler is functional but simple — a correct DWRR implementation that works for a lightly-loaded system. For the KDE/Wayland desktop with dozens of competing threads (compositor, rendering, I/O, timers, D-Bus, input), it needs:

Real-time scheduling (S2) — for audio and compositor input threads
PI futexes (S4/R2) — to prevent the compositor lock from being inverted by background work
Load balancing (S3) — to use all available cores efficiently
Dynamic priority (S5) — to keep the compositor responsive under CPU load

These four items are the critical path to a responsive desktop. The remaining items (NUMA, thread groups, robust mutexes, affinity API) are important for correctness and server-class workloads but not desktop-blocking.

Total estimated effort: 13 weeks with 1-2 kernel developers, delivering incremental improvements at each phase boundary.

42 KiB Raw Blame History Unescape Escape

Red Bear OS — Kernel Scheduler, Multithreading, and IPC Performance Improvement Plan

1. Executive Summary

Current State at a Glance

Why This Matters for the Desktop Path

2. Current Architecture Assessment

2.1 Scheduler Architecture

2.2 Context/Thread Model

2.3 Futex Implementation

2.4 relibc pthread Completeness

2.5 IPC Primitives Relevant to Multithreading

3. Critical Gaps and Blockers

3.1 Priority Gaps (Blocking Desktop Responsiveness)

3.2 Scalability Gaps (Blocking Many-Core Performance)

3.3 Correctness Gaps (Blocking Robust Applications)

4. Implementation Plan

Phase S1: Scheduler Observability and Metrics (Week 1-2)

S1.1 — Per-context scheduling statistics

S1.2 — Per-CPU scheduler metrics

S1.3 — /scheme/sys/sched debug interface

S1.4 — relibc sched_getscheduler() baseline

Phase S2: Real-Time Scheduling Support (Week 2-4)

S2.1 — Scheduling policy in Context

S2.2 — Priority mapping

S2.3 — Scheduler dispatch by policy

S2.4 — SCHED_RR timeslice configuration

S2.5 — syscall interface for policy changes

S2.6 — Wire relibc sched_setscheduler()

Phase S3: Load Balancing and Work Stealing (Week 4-6)

S3.1 — Per-CPU run queue lock elimination

S3.2 — Idle CPU work stealing

S3.3 — Initial placement (fork/exec balance)

S3.4 — Periodic load balancing

Phase S4: Futex Enhancements (Week 6-9)

S4.1 — Per-futex locking (reduce global contention)

S4.2 — FUTEX_REQUEUE and FUTEX_CMP_REQUEUE

S4.3 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI / FUTEX_CMP_REQUEUE_PI)

S4.4 — Robust Futexes

Phase S5: Dynamic Priority and Thread Management (Week 9-11)

S5.1 — Dynamic priority adjustment (SCHED_OTHER)

S5.2 — POSIX nice values

S5.3 — CPU affinity API

S5.4 — Thread naming API

Phase S6: NUMA and Cache-Affine Scheduling (Week 11-13)

S6.1 — NUMA topology discovery

S6.2 — NUMA-aware initial placement

S6.3 — Cache-affine scheduling

Phase R1: relibc POSIX Scheduling API Completion (Week 2-4, parallel with S2)

Phase R2: Robust and PI Mutex Support (Week 5-9, parallel with S4)

R2.1 — PI mutex protocol

R2.2 — Robust mutex protocol

Phase R3: Thread Groups and Process Identity (Week 10-12)

R3.1 — Kernel thread group concept

R3.2 — Thread group signal delivery

5. Dependency Chain

6. Integration with Existing Plans

7. Patch Governance

8. Validation and Evidence

8.1 Build Evidence

8.2 Runtime Evidence

8.3 Verification Scripts

9. Bottom Line

42 KiB

Raw Blame History

S1.3 — `/scheme/sys/sched` debug interface

S1.4 — relibc `sched_getscheduler()` baseline

S2.6 — Wire relibc `sched_setscheduler()`