Files
RedBear-OS/local/docs/archived/KERNEL-SCHEDULER-MULTITHREAD-IMPROVEMENT-PLAN.md
vasilito 13ac42b218 docs: final stale doc cleanup — 22 archived, 18 active
Archived: IOMMU-SPEC, KERNEL-IPC, KERNEL-SCHEDULER, PROFILE-MATRIX,
QUIRKS-IMPROVEMENT, RELIBC-IPC, repo-governance, SCHEDULER-REVIEW,
SCRIPT-BEHAVIOR, USB-VALIDATION, XHCID-DEVICE-IMPROVEMENT.

Active: all implementation plans + 3 audits + governance docs.
2026-05-03 16:26:13 +01:00

1027 lines
42 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Red Bear OS — Kernel Scheduler, Multithreading, and IPC Performance Improvement Plan
**Date:** 2026-04-30
**Scope:** Kernel scheduler optimization, futex enhancements, multithreaded performance, relibc POSIX threading completeness
**Status:** S3 complete (per-CPU + stealing + balancing + placement), S4 complete (futex sharding + REQUEUE + PI + robust + vruntime), S5 complete (setpriority + affinity + naming + schedparam), S6 partial (cache-affine delivered, NUMA deferred). This is the **canonical scheduler + multithreading authority**, extending `KERNEL-IPC-CREDENTIAL-PLAN.md` and `RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md`
---
## 1. Executive Summary
The Redox microkernel currently uses a **Deficit Weighted Round Robin (DWRR)** scheduler with 40 static priority levels, per-CPU run queues, and cooperative preemption. The relibc C library provides a largely complete pthreads implementation, but POSIX scheduling APIs (`sched_*`, `pthread_setschedparam`) are stubbed out. For the KDE/Wayland desktop path, multithreaded performance bottlenecks in the scheduler and futex subsystem will become the dominant limitation once the compositor (KWin) and GPU rendering pipelines are active.
### Current State at a Glance
| Area | Status | Key Gaps |
|------|--------|----------|
| Kernel scheduler | DWRR, 40 levels, vruntime selection for SCHED_OTHER, RT pass for FIFO/RR | Per-CPU run queues are infrastructure only; load balancing deferred |
| Futex | WAIT/WAIT64/WAKE + 64-shard hash table | No PI, no requeue, no robust futex |
| relibc pthreads | Create/join/detach/mutex/cond/rwlock/barrier/spin/tls | `sched_*` all `todo!()`, no PI/robust mutexes, no affinity API |
| Thread management | proc: scheme clone/fork/exec | No dynamic priority, no CPU affinity from userspace, no thread groups |
| IPC for threading | Futex, shared memory, signals | No process-shared robust/PI mutexes, no adaptive spinning |
### Why This Matters for the Desktop Path
```
KWin compositor (Qt6/QPA/Wayland)
└── Worker threads: rendering, input, effects
└── Requires: efficient futex wakeups, PI for compositor lock
└── Requires: SCHED_RR for input thread priority
Mesa GPU driver (LLVMpipe or hardware)
└── Gallium worker threads: shader compilation, draw submission
└── Requires: load-balanced scheduling across all CPUs
└── Requires: non-contended futex performance
Qt6 event loop
└── Thread pool for QFuture/QtConcurrent
└── Requires: SCHED_OTHER fair scheduling under load
└── Requires: proper pthread_attr_setschedparam
```
---
## 2. Current Architecture Assessment
### 2.1 Scheduler Architecture
**File:** `recipes/core/kernel/source/src/context/switch.rs`
**Algorithm:** Deficit Weighted Round Robin (DWRR) — documented at line 354:
```rust
/// This is the scheduler function which currently utilises Deficit Weighted Round Robin Scheduler
fn select_next_context(...)
```
**Key data structures** (from `context/mod.rs`):
```rust
// 40 priority levels, each with its own queue
pub struct RunContextData {
set: [VecDeque<WeakContextRef>; 40],
}
// Global lock for run queues (L1 = highest-level lock)
static RUN_CONTEXTS: Mutex<L1, RunContextData> = ...;
// Idle/sleeping contexts — scanned linearly on every tick
static IDLE_CONTEXTS: Mutex<L2, VecDeque<WeakContextRef>> = ...;
// All contexts (for enumeration)
static CONTEXTS: RwLock<L2, BTreeSet<ContextRef>> = ...;
```
**Priority weights** (geometric decay ~1.25x per level):
```rust
const SCHED_PRIO_TO_WEIGHT: [usize; 40] = [
88761, 71755, 56483, 46273, 36291, 29154, 23254, 18705, 14949, 11916,
9548, 7620, 6100, 4904, 3906, 3121, 2501, 1991, 1586, 1277,
1024, 820, 655, 526, 423, 335, 272, 215, 172, 137,
110, 87, 70, 56, 45, 36, 29, 23, 18, 15,
];
```
**Time quantum:** 3 PIT ticks per context (~12.2ms). PIT channel 0 has divisor 4847 at 1.193182 MHz → ~4.062ms per tick. 3 ticks → ~12.2ms between context switches. The 6.75ms in the tick() comment is outdated.
**Default priority:** 20 (middle of range).
**Max scheduler iterations:** 5000 per `select_next_context` call (bail-out limit).
**Per-CPU state:** `percpu.balance: [usize; 40]` (deficit counters), `percpu.last_queue` (round-robin position).
**Preemption:** Preemptible unless `context.preempt_locks > 0` (guarded by `PreemptGuard` RAII wrappers).
**Context switch lock:** Global `arch::CONTEXT_SWITCH_LOCK` — spinlock with `compare_exchange_weak` on `Ordering::SeqCst`.
**Current limitations:**
1. **No real-time scheduling wired to userspace** — kernel has SchedPolicy enum and RT scheduling pass, but relibc `sched_setscheduler` returns ENOSYS for FIFO/RR until kernel wire-up is complete.
2. **No dynamic priority adjustment**`context.prio` is set once and never changes. vruntime-based fairness compensates for SCHED_OTHER but no nice-value decay/boost.
3. **No work stealing** — each CPU only dequeues from its own queues. A CPU can go idle while another has backlog.
4. **No load balancing** — newly created contexts go to the creating CPU's idle queue. No migration across CPUs.
5. **O(n) idle wakeup scan**`wakeup_contexts()` linearly scans the entire `IDLE_CONTEXTS` VecDeque on every tick (every ~2.25ms effective).
6. **Single global context switch lock**`arch::CONTEXT_SWITCH_LOCK` serializes all CPU context switches on many-core systems.
7. **No NUMA awareness** — memory locality is not considered during scheduling.
8. **No timeslice scaling** — all contexts get the same 3-tick quantum regardless of priority (priority only affects how often they're picked, not how long they run).
9. **Large fixed iteration limit** — 5000 iterations per schedule attempt can cause latency spikes under heavy load.
### 2.2 Context/Thread Model
**File:** `recipes/core/kernel/source/src/context/context.rs`
```rust
pub struct Context {
pub prio: usize, // Priority (0-39, default 20)
pub status: Status, // Runnable / Blocked / HardBlocked / Dead
pub running: bool, // Currently on a CPU
pub cpu_id: Option<LogicalCpuId>,// Which CPU this context is on
pub sched_affinity: LogicalCpuSet, // Allowed CPU set
pub cpu_time: u128, // Accumulated CPU time (nanoseconds)
pub switch_time: u128, // Last switch-in time
pub wake: Option<u128>, // Wake timestamp for timed sleeps
pub preempt_locks: usize, // Preemption disable counter
pub kfx: AlignedBox, // SIMD/FPU save area
pub addr_space: Option<Arc<AddrSpaceWrapper>>, // Can be shared (threads)
pub files: Arc<LockedFdTbl>, // Can be shared (same process threads)
pub owner_proc_id: Option<NonZeroUsize>, // Parent process
pub name: ArrayString<32>, // Human-readable name
// Credentials:
pub euid: u32, pub egid: u32, pub pid: usize,
pub groups: Vec<u32>, // Supplementary groups
}
```
**Thread creation flow:**
```
pthread_create()
→ relibc::pthread::create()
→ mmap() for stack
→ Tcb::new() for TLS
→ stack setup with entry shim
→ Sys::rlct_clone(stack, os_specific)
→ redox_rt::clone()
→ proc: scheme -> kernel clone
→ Context::new() (same owner_proc_id, shared addr_space)
→ context::spawn() (pushed to IDLE_CONTEXTS)
```
**Key architectural points:**
- Threads share the same `addr_space: Arc<AddrSpaceWrapper>` (same page tables)
- Threads share `files: Arc<LockedFdTbl>` (same FD table)
- Thread ownership via `owner_proc_id` — but no formal thread group concept
- No distinction between process and thread at kernel level — all are Contexts
- `pid` is set once, no `tgid`/`tid` distinction
### 2.3 Futex Implementation
**File:** `recipes/core/kernel/source/src/syscall/futex.rs`
```rust
// Global hash table: PhysicalAddress → Vec<FutexEntry>
type FutexList = HashMap<PhysicalAddress, Vec<FutexEntry>>;
static FUTEXES: Mutex<L1, FutexList> = ...;
pub struct FutexEntry {
target_virtaddr: VirtualAddress,
context_lock: Arc<ContextLock>,
addr_space: Weak<AddrSpaceWrapper>, // For CoW safety
}
```
**Supported operations:**
| Op | Status | Notes |
|----|--------|-------|
| `FUTEX_WAIT` (32-bit) | ✅ | Validates alignment (4-byte), checks value, blocks |
| `FUTEX_WAIT64` (64-bit) | ✅ | x86_64 only, checks alignment (8-byte) |
| `FUTEX_WAKE` | ✅ | Wakes up to `val` waiters, `O(n)` scan by virtual address matching |
**NOT supported (critical gaps):**
| Op | Impact |
|----|--------|
| `FUTEX_REQUEUE` | Cannot move waiters between futexes — needed by condvar broadcast |
| `FUTEX_CMP_REQUEUE` | Cannot atomically compare-and-requeue — race condition risk |
| `FUTEX_WAKE_OP` | Cannot do atomic op + wake — needed by glibc mutex fast path |
| `FUTEX_LOCK_PI` | No priority inheritance — PTHREAD_PRIO_INHERIT is a stub |
| `FUTEX_TRYLOCK_PI` | No trylock with PI |
| `FUTEX_UNLOCK_PI` | No unlock with PI |
| `FUTEX_CMP_REQUEUE_PI` | No requeue with PI |
| `FUTEX_WAIT_BITSET` | No bitset wait — needed for `pselect`/`ppoll` optimization |
| `FUTEX_WAKE_BITSET` | No bitset wake |
| `FUTEX_WAIT_MULTIPLE` | As noted in code TODO, not implemented |
| `FUTEX_PRIVATE` flag | Conceptual TODO in code comment — "implement fully in userspace" |
**Performance concerns:**
1. **Global `FUTEXES` mutex** — all futex operations on all CPUs contend on a single L1 lock
2. **O(n) wake scan**`FUTEX_WAKE` iterates all entries for a physical address to match by virtual address
3. **Full `HashMap` entry removal** — on wake, entry is `swap_remove`'d; on last waiter, the entire `HashMap` entry is removed (churn)
4. **No per-process futex isolation** — all futexes share the same global table, even process-private ones
5. **No wait-multiple** — waking multiple independent futexes requires multiple syscalls
### 2.4 relibc pthread Completeness
**Files:** `src/pthread/mod.rs`, `src/header/pthread/*.rs`, `src/header/sched/mod.rs`
| API Surface | Status | Notes |
|-------------|--------|-------|
| `pthread_create` / `pthread_join` / `pthread_detach` | ✅ Full | Stack via mmap, TLS init, waitval for join |
| `pthread_mutex_*` (normal, recursive, errorcheck) | ✅ Full | Internal implementation in `src/sync/` |
| `pthread_cond_*` | ✅ Full | Condition variables present |
| `pthread_rwlock_*` | ✅ Full | Read-write locks present |
| `pthread_barrier_*` | ✅ Full | Barriers present |
| `pthread_spin_*` | ✅ Full | Spinlocks present |
| `pthread_key_*` / TLS | ✅ Full | Thread-local storage with destructors |
| `pthread_once` | ✅ Full | call_once pattern |
| `pthread_cancel` / `pthread_setcancelstate` / `pthread_setcanceltype` | ✅ Full | Deferred + async cancellation via RT signal |
| `pthread_attr_*` (init/destroy/get/set) | ✅ Full | All attribute accessors implemented |
| `pthread_getattr_np` | ✅ Partial | Stack base/size returned; other attrs default |
| `pthread_setname_np` / `pthread_getname_np` | ✅ Delivered | Kernel proc: Name handle + relibc wrapper |
| `pthread_attr_setschedpolicy` | 🚧 Accepts value, kernel ignores | Kernel pays no attention to policy |
| `pthread_attr_setschedparam` | 🚧 Accepts value, kernel ignores | `sched_priority` stored but unused |
| `pthread_setschedparam` | 🚧 No-op | `set_sched_param()` — TODO comment |
| `pthread_setschedprio` | 🚧 No-op | `set_sched_priority()` — TODO comment |
| `pthread_mutexattr_setprotocol` | 🚧 Stub | PTHREAD_PRIO_INHERIT accepted but no-op |
| `pthread_mutexattr_setrobust` | 🚧 Stub | PTHREAD_MUTEX_ROBUST accepted but no-op |
| `pthread_mutexattr_setpshared` | 🚧 Partial | PROCESS_SHARED constant exists; futex supports cross-AS |
| `pthread_getcpuclockid` | 🚧 ENOENT | `get_cpu_clkid()` returns ENOENT |
| `pthread_kill` | ⚠️ Failing | Failing tests (child/invalid/self) — race condition noted at `signal/mod.rs:178` |
| `pthread_atfork` | ❌ Empty stubs | Registered handlers exist but are no-ops — fork is NOT thread-safe |
| `pthread_sigmask` | ✅ | Via `sigprocmask` |
| `pthread_atfork` | ✅ | fork hooks present |
| **sched.h functions:** | | |
| `sched_yield` | ✅ | Via `Sys::sched_yield()` |
| `sched_get_priority_max` | 🚧 `todo!()` | |
| `sched_get_priority_min` | 🚧 `todo!()` | |
| `sched_getparam` | 🚧 `todo!()` | |
| `sched_setparam` | 🚧 `todo!()` | |
| `sched_setscheduler` | 🚧 `todo!()` | |
| `sched_rr_get_interval` | 🚧 `todo!()` | |
### 2.5 IPC Primitives Relevant to Multithreading
From `KERNEL-IPC-CREDENTIAL-PLAN.md` and direct code review:
| Primitive | Kernel Support | Threading Impact |
|-----------|---------------|-----------------|
| Futex | WAIT/WAKE only | **Critical** — base primitive for all userspace sync |
| Shared memory (shm/mmap MAP_SHARED) | ✅ Via memory scheme | Required for PTHREAD_PROCESS_SHARED |
| Signals (per-thread) | ✅ Via proc: scheme | Thread cancellation, SIGEV_THREAD |
| Pipe (kernel `pipe:` scheme) | ✅ | Thread communication |
| eventfd/signalfd/timerfd | ✅ Recipe-applied | Async I/O notification |
| SysV sem/shm | ✅ Recipe-activated (2026-04-29) | Qt QSystemSemaphore |
| POSIX msg queues | ❌ Missing | Low priority for desktop |
| SysV msg queues | ❌ Missing | Low priority for desktop |
---
## 3. Critical Gaps and Blockers
### 3.1 Priority Gaps (Blocking Desktop Responsiveness)
| # | Gap | Impact | Blocked Consumer |
|---|-----|--------|-----------------|
| G1 | **No SCHED_RR/SCHED_FIFO** | All threads treated equally; input/audio threads can't get priority | KWin input thread, PulseAudio |
| G2 | **No dynamic priority** | CPU-bound threads aren't penalized; I/O-bound threads aren't boosted | Desktop compositor under load |
| G3 | **No PI futexes** | Priority inversion: low-priority thread holding mutex blocks high-priority waiter | KWin compositor lock, Qt mutexes |
| G4 | **No `pthread_setschedparam`** | Applications can't request scheduling policy changes | All desktop apps |
| G5 | **No timeslice differentiation** | High-priority threads get same quantum as low-priority | Poor latency for foreground tasks |
### 3.2 Scalability Gaps (Blocking Many-Core Performance)
| # | Gap | Impact |
|---|-----|--------|
| G6 | **No work stealing** | CPUs go idle while work exists on other CPUs |
| G7 | **No load balancing** | New threads stay on creator CPU; imbalance builds over time |
| G8 | **Global context switch lock** | Serialization bottleneck beyond ~8 cores |
| G9 | **Global futex mutex** | All cores contend on single L1 lock for futex ops |
| G10 | **O(n) idle wake scan** | Linear scan proportional to total sleeping threads |
| G11 | **No NUMA awareness** | Cross-node memory access penalty on multi-socket systems |
### 3.3 Correctness Gaps (Blocking Robust Applications)
| # | Gap | Impact |
|---|-----|--------|
| G12 | **No robust mutexes** | Thread death while holding mutex → permanent deadlock |
| G13 | **No FUTEX_REQUEUE** | Condvar broadcast wakes all waiters → thundering herd |
| G14 | **No thread groups (tgid)** | `kill(pid, sig)` can't target a process; `getpid()` per thread context |
| G15 | **Static-only sched_affinity** | No userspace CPU pinning API |
| G16 | **No setpriority/getpriority** | POSIX nice values not wired to kernel priority |
| G17 | **pthread barriers hang on SMP** | `check.sh` runs `-smp 1` to work around barrier/once hang on multi-core QEMU — **blocks KWin GPU barrier sync** |
| G18 | **pthread_kill race condition** | All four pthread_kill tests (child/invalid/self/kill0) are failing — thread-targeted signal delivery unreliable |
| G19 | **fork() thread-unsafe** | `pthread_atfork` handlers are empty no-ops; child inherits locked mutexes from parent |
| G20 | **Linux aarch64 rlct_clone stub** | `todo!("rlct_clone not implemented for aarch64 yet")`**blocks aarch64 builds** |
---
## 4. Implementation Plan
### Phase S1: Scheduler Observability and Metrics (Week 1-2)
**Goal:** Add instrumentation to measure and understand scheduling behavior before optimizing.
#### S1.1 — Per-context scheduling statistics
Add to `Context` struct:
```rust
pub struct Context {
// NEW scheduling statistics:
pub sched_run_count: u64, // Times this context was scheduled
pub sched_wait_time: u128, // Total time spent waiting (accumulated)
pub sched_last_wake: u128, // Timestamp of last unblock
pub sched_migrations: u32, // Times migrated between CPUs
pub sched_preemptions: u32, // Times preempted
pub sched_voluntary_switch: u32, // Times yielded/blocked voluntarily
}
```
**Files:** `context/context.rs` — add fields, initialize in `Context::new()`, update in `switch()`
#### S1.2 — Per-CPU scheduler metrics
Add to `cpu_stats.rs`:
```rust
pub struct CpuStats {
// Existing: user, nice, kernel, idle, irq
// NEW:
pub sched_scans: AtomicU64, // number of select_next_context calls
pub sched_empty_scans: AtomicU64, // scans that found no runnable context
pub sched_steals: AtomicU64, // work stolen from other CPUs (future)
pub sched_ipi_wakeups: AtomicU64, // wakeups via IPI
pub sched_max_queue_depth: AtomicU64, // maximum queue depth observed
}
```
#### S1.3 — `/scheme/sys/sched` debug interface
Expose scheduler metrics via a new kernel scheme path:
```
scheme:sys/sched/runqueues — per-CPU run queue depths
scheme:sys/sched/top — top-N contexts by recent CPU time
scheme:sys/sched/context/{id} — per-context scheduling stats
```
This enables `redbear-info` or a new `redbear-sched` tool for runtime diagnostics.
#### S1.4 — relibc `sched_getscheduler()` baseline
Wire `sched_getscheduler()` to return `SCHED_OTHER` (the current DWRR is closest to SCHED_OTHER):
```rust
// relibc/src/header/sched/mod.rs
pub extern "C" fn sched_getscheduler(pid: pid_t) -> c_int {
// For now: all processes use SCHED_OTHER (DWRR)
SCHED_OTHER
}
```
**Patch:** `local/patches/relibc/P5-sched-observe.patch`
---
### Phase S2: Real-Time Scheduling Support (Week 2-4)
**Goal:** Add `SCHED_FIFO` and `SCHED_RR` scheduling classes to the kernel, and wire relibc `sched_setscheduler()`.
#### S2.1 — Scheduling policy in Context
Add to `Context`:
```rust
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum SchedPolicy {
Other, // DWRR (current default)
Fifo, // Strict priority, no preemption within same priority
RoundRobin, // Strict priority, round-robin within same priority
// Future:
// Batch, // Throughput-optimized, lower priority than Other
// Idle, // Only runs when absolutely nothing else is runnable
}
pub struct Context {
pub sched_policy: SchedPolicy, // NEW
pub sched_rt_priority: u8, // NEW: 0-99 RT priority
// Renamed: prio → sched_dynamic_prio (for SCHED_OTHER)
pub sched_dynamic_prio: usize,
pub sched_static_prio: usize, // NEW: base priority, unmodified by heuristics
}
```
**Initialization:** Default `sched_policy = SchedPolicy::Other`, `sched_rt_priority = 0`.
#### S2.2 — Priority mapping
```
RT priority 99 → kernel prio 0 (highest)
RT priority 98 → kernel prio 1
...
RT priority 0 → kernel prio 39 (lowest RT, still above SCHED_OTHER)
SCHED_OTHER:
nice -20 → kernel prio 0 (still below RT 0)
nice 0 → kernel prio 20 (default)
nice +19 → kernel prio 39
SCHED_FIFO within same RT priority: no preemption (runs until blocks)
SCHED_RR within same RT priority: round-robin with configurable quantum
```
#### S2.3 — Scheduler dispatch by policy
Modify `select_next_context()` to prioritize:
1. `SCHED_FIFO` contexts (highest RT priority first, no preemption per priority)
2. `SCHED_RR` contexts (highest RT priority first, round-robin per priority)
3. `SCHED_OTHER` contexts (existing DWRR)
```rust
fn select_next_context(...) -> ... {
// PASS 1: SCHED_FIFO — first runnable at highest priority wins
for prio in 0..40 {
if let Some(fifo_ctx) = take_first_runnable_of_policy(
prio, SchedPolicy::Fifo, &mut contexts_list
) {
return Ok(Some(fifo_ctx));
}
}
// PASS 2: SCHED_RR — round-robin within priority
for prio in 0..40 {
if let Some(rr_ctx) = take_next_rr_of_policy(
prio, &mut contexts_list, &mut percpu.rr_position[prio]
) {
return Ok(Some(rr_ctx));
}
}
// PASS 3: SCHED_OTHER — existing DWRR (unchanged)
existing_dwrr_logic(...)
}
```
#### S2.4 — SCHED_RR timeslice configuration
Add per-context timeslice for SCHED_RR:
```rust
pub struct Context {
pub sched_rr_quantum: u128, // nanoseconds, default 100ms
}
```
Override the 3-tick quantum for SCHED_RR contexts: track ticks consumed, preempt at quantum.
#### S2.5 — syscall interface for policy changes
Add kernel syscall or extend `proc:` scheme:
```
proc: scheme command: SetSchedPolicy(pid, policy, rt_priority)
```
#### S2.6 — Wire relibc `sched_setscheduler()`
```rust
// relibc/src/header/sched/mod.rs
pub extern "C" fn sched_setscheduler(
pid: pid_t, policy: c_int, param: *const sched_param,
) -> c_int {
let prio = unsafe { (*param).sched_priority };
let kernel_policy = match policy {
SCHED_FIFO => SchedPolicyRequest::Fifo,
SCHED_RR => SchedPolicyRequest::RoundRobin,
SCHED_OTHER => SchedPolicyRequest::Other,
_ => return set_errno(EINVAL),
};
// Send to kernel via proc: scheme
Sys::set_sched_policy(pid, kernel_policy, prio)
}
```
**Patches:**
- `local/patches/kernel/P5-sched-policy.patch` — Context fields + sched dispatch
- `local/patches/kernel/P5-sched-policy-proc.patch` — proc: scheme SetSchedPolicy
- `local/patches/relibc/P5-sched-setscheduler.patch` — wire through scheme
- `local/patches/relibc/P5-sched-getscheduler.patch` — return current policy
- `local/patches/relibc/P5-sched-priority.patch` — sched_get/setparam
---
### Phase S3: Load Balancing and Work Stealing (Week 4-6)
**Status: ✅ COMPLETE (2026-04-30)** — P3.1 PerCpuSched struct + P3.2 per-CPU wiring + P3.3 work stealing + P3.4 initial placement (least-loaded CPU) + P3.5 periodic load balancing all implemented.
**Goal:** Distribute runnable contexts across CPUs to maximize utilization.
#### S3.1 — Per-CPU run queue lock elimination
Replace the global `RUN_CONTEXTS: Mutex<L1, RunContextData>` with per-CPU run queues:
```rust
// In PercpuBlock:
pub struct PerCpuSched {
pub run_queues: [VecDeque<WeakContextRef>; 40],
pub run_queues_lock: SpinLock, // per-CPU, low contention
pub balance: [usize; 40],
pub last_queue: usize,
pub idle_context: Arc<ContextLock>,
}
```
This eliminates the global L1 mutex bottleneck for dequeue operations.
#### S3.2 — Idle CPU work stealing
When `select_next_context()` finds no runnable context on the local CPU:
1. Pick a victim CPU (round-robin or random)
2. Lock victim's run queues
3. Dequeue the highest-priority runnable context
4. Return it for scheduling
```rust
fn steal_work(percpu: &PercpuBlock, cpu_id: LogicalCpuId) -> Option<ArcContextLockWriteGuard> {
for victim_offset in 1..cpu_count() {
let victim_id = (cpu_id + victim_offset) % cpu_count();
let victim_percpu = percpu_for(victim_id);
// Try to steal from highest priority queues first
for prio in 0..40 {
if let Some(ctx) = victim_percpu.dequeue_runnable(prio) {
percpu.stats.sched_steals.fetch_add(1, Ordering::Relaxed);
return Some(ctx);
}
}
}
None
}
```
#### S3.3 — Initial placement (fork/exec balance)
When creating a new context, instead of always going to the creating CPU's idle queue:
```rust
fn place_new_context(ctx: &mut Context) -> LogicalCpuId {
// Pick the CPU with the shortest total run queue
let target = cpus()
.min_by_key(|cpu| cpu.total_runnable_contexts())
.unwrap_or(crate::cpu_id());
ctx.sched_affinity = LogicalCpuSet::single(target);
target
}
```
#### S3.4 — Periodic load balancing
Add a periodic balancing trigger (e.g., every 100ms or when queue depth difference exceeds threshold):
```rust
fn balance_load() {
let avg_depth = average_runnable_per_cpu();
for cpu in overloaded_cpus(avg_depth * 1.25) {
let target = most_idle_cpu();
migrate_contexts(cpu, target, cpu.total_runnable() - avg_depth);
}
}
```
**Patches:**
- `local/patches/kernel/P6-percpu-runqueues.patch` — per-CPU run queues (infrastructure)
---
### Phase S4: Futex Enhancements (Week 6-9)
**Status: ✅ COMPLETE (2026-04-30)** — S4.1 futex sharding (64-shard), S4.2 FUTEX_REQUEUE, S4.3 PI futex, S4.4 robust futex, vruntime tracking, minimum-vruntime selection all implemented.
**Goal:** Add PI, requeue, and per-futex locking to support robust desktop mutex performance.
#### S4.1 — Per-futex locking (reduce global contention)
Replace the single `FUTEXES: Mutex<L1, FutexList>` with a sharded hash table:
```rust
const FUTEX_SHARDS: usize = 64; // or scale with CPU count
static FUTEXES: [Mutex<FutexList>; FUTEX_SHARDS] = ...;
fn futex_shard(phys: PhysicalAddress) -> usize {
phys.data() as usize % FUTEX_SHARDS
}
```
#### S4.2 — FUTEX_REQUEUE and FUTEX_CMP_REQUEUE
```rust
fn futex_requeue(
addr1: PhysicalAddress, // source futex
addr2: PhysicalAddress, // target futex
val: usize, // max to requeue
val2: usize, // expected value (for CMP_REQUEUE)
cmp: bool, // whether to compare first
) -> Result<usize> {
// Atomically move up to `val` waiters from addr1's wait queue to addr2's
// If cmp is true, only proceed if *addr1 == val2
}
```
This is critical for condition variable performance — without it, `pthread_cond_broadcast` causes a thundering herd where every waiter wakes, rechecks, and most re-block.
#### S4.3 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI / FUTEX_CMP_REQUEUE_PI)
Priority inheritance for futexes:
```rust
pub struct PiState {
owner: Option<Arc<ContextLock>>,
waiters: Vec<(Arc<ContextLock>, u32)>, // (context, original_priority)
}
// When a high-priority context blocks on a PI futex held by a low-priority context:
fn pi_boost(owner: &mut Context, waiter_prio: usize) {
if waiter_prio < owner.sched_dynamic_prio {
owner.sched_dynamic_prio = waiter_prio;
owner.pi_boosted = true;
}
}
```
**Critical path:** KWin compositor lock. Without PI, a low-priority background thread holding a mutex that the compositor thread needs can block rendering for an unbounded time.
#### S4.4 — Robust Futexes
Mark futex waiters in a `robust_list` so the kernel can unlock them on thread death:
```rust
pub struct RobustListEntry {
futex_addr: usize,
futex_len: usize,
// List is per-thread, registered via set_robust_list syscall
}
```
On `exit_thread()`:
```rust
fn wake_robust_futexes(context: &Context) {
for entry in &context.robust_list {
// Set FUTEX_OWNER_DIED bit
// Wake one waiter with EOWNERDEAD
}
}
```
**Patches:**
- `local/patches/kernel/P6-futex-sharding.patch` — futex lock sharding (delivered)
- (PI futex, requeue, robust futex deferred)
---
### Phase S5: Dynamic Priority and Thread Management (Week 9-11)
**Status: ✅ COMPLETE (2026-04-30)** — S5.1 vruntime + S5.2 setpriority/getpriority + S5.3 pthread_setaffinity_np + S5.4 pthread_setname_np + pthread_setschedparam (Redox) all implemented.
**Goal:** Add I/O-vs-CPU heuristics, CPU affinity API, and thread naming.
#### S5.1 — Dynamic priority adjustment (SCHED_OTHER)
Implement a simplified CFS-style virtual runtime tracking:
```rust
pub struct Context {
pub vruntime: u128, // Virtual runtime (weighted by priority)
}
// On context switch OUT:
prev_context.vruntime += actual_runtime * SCHED_PRIO_TO_WEIGHT[default_prio]
/ SCHED_PRIO_TO_WEIGHT[prev_context.sched_static_prio];
// On select_next_context for SCHED_OTHER:
// Pick context with lowest vruntime instead of DWRR deficit tracking
```
This automatically penalizes CPU-bound threads (their vruntime grows faster) and favors I/O-bound threads (they sleep, vruntime stays low).
#### S5.2 — POSIX nice values
Map `nice(-20..+19)` to static priorities:
```rust
fn nice_to_static_prio(nice: i8) -> usize {
// nice -20 → kernel prio 0 (SCHED_OTHER range)
// nice 0 → kernel prio 20
// nice +19 → kernel prio 39
((nice + 20) as usize).clamp(0, 39)
}
// Wire setpriority/getpriority to modify sched_static_prio
```
#### S5.3 — CPU affinity API
Add to `proc:` scheme:
```
proc: scheme command: SetAffinity(pid, affinity_mask: u64)
proc: scheme command: GetAffinity(pid) → u64
```
Wire in relibc:
```rust
pub extern "C" fn pthread_setaffinity_np(
thread: pthread_t, cpusetsize: size_t, cpuset: *const cpu_set_t,
) -> c_int {
let mask = unsafe { read_cpu_set(cpuset, cpusetsize) };
Sys::set_cpu_affinity(tid, mask)
}
```
#### S5.4 — Thread naming API
The kernel `Context.name` field already exists (32-char `ArrayString`). Wire it:
```rust
// proc: scheme command: SetName(pid, name)
// relibc:
pub extern "C" fn pthread_setname_np(thread: pthread_t, name: *const c_char) -> c_int {
let name = unsafe { CStr::from_ptr(name) };
Sys::set_thread_name(thread.os_tid, name)
}
```
**Patches:**
- `local/patches/kernel/P6-vruntime-context.patch` — vruntime field + initialization
- `local/patches/kernel/P6-vruntime-switch.patch` — weighted update + min-vruntime selection
- `local/patches/kernel/P7-cache-affine-context.patch` — cache-affine scheduling (last_cpu)
- `local/patches/kernel/P7-cache-affine-switch.patch` — cache-affine vruntime bonus
- `local/patches/kernel/P7-proc-setpriority.patch` — setpriority proc handle
- `local/patches/kernel/P7-proc-setname.patch` — thread naming proc handle
- `local/patches/relibc/P7-setpriority.patch` — setpriority/getpriority
- `local/patches/relibc/P7-pthread-affinity.patch` — pthread_setaffinity_np
- `local/patches/relibc/P7-pthread-setname.patch` — pthread_setname_np
---
### Phase S6: NUMA and Cache-Affine Scheduling (Week 11-13)
**Status: ✅ DELIVERED (2026-04-30)** — S6.3 cache-affine scheduling + S6.1 NUMA topology kernel hints implemented. NUMA discovery (SRAT/SLIT parsing) is userspace responsibility (numad daemon via /scheme/acpi/). Kernel stores lightweight NumaTopology for O(1) scheduling lookups. Full userspace numad daemon is follow-up work.
**Goal:** Optimize for multi-socket systems by keeping related threads near their memory.
#### S6.1 — NUMA topology discovery
Parse ACPI SRAT/SLIT tables (already available in ACPI infrastructure):
```rust
pub struct NumaTopology {
nodes: Vec<NumaNode>,
distances: Vec<Vec<u8>>, // SLIT inter-node distances
}
pub struct NumaNode {
id: u8,
cpus: LogicalCpuSet,
memory: PhysicalMemoryRange,
}
```
#### S6.2 — NUMA-aware initial placement
When creating a new context:
1. If parent thread has `sched_affinity`, prefer CPUs in the same NUMA node
2. Otherwise, pick the NUMA node with the most free memory
#### S6.3 — Cache-affine scheduling
Track the last CPU a context ran on. Prefer to re-schedule on the same CPU to avoid cache migration penalty:
```rust
pub struct Context {
pub sched_last_cpu: LogicalCpuId, // already tracked via cpu_id before it becomes None
}
```
In `select_next_context()`:
```rust
// When scanning runnable contexts, prefer those whose last_cpu == current_cpu_id
// (hot cache) over those from other CPUs (cold cache)
let hot_ctx = search_for_hot_context(current_cpu, &queues);
let fallback = search_for_cold_context(&queues);
hot_ctx.or(fallback)
```
**Patches:**
- `local/patches/kernel/P7-cache-affine-context.patch` — cache-affine scheduling (delivered)
- `local/patches/kernel/P7-cache-affine-switch.patch` — cache-affine vruntime bonus (delivered)
- (NUMA SRAT/SLIT parsing deferred)
---
### Phase R1: relibc POSIX Scheduling API Completion (Week 2-4, parallel with S2)
**Goal:** Fill all `todo!()` stubs in `sched.h` and `pthread.h` scheduling functions.
| Function | Implementation |
|----------|---------------|
| `sched_get_priority_max(policy)` | Return 99 for FIFO/RR, 0 for OTHER |
| `sched_get_priority_min(policy)` | Return 1 for FIFO/RR, 0 for OTHER |
| `sched_getparam(pid, param)` | Query kernel for current RT priority |
| `sched_setparam(pid, param)` | Delegate to `sched_setscheduler` with current policy |
| `sched_getscheduler(pid)` | Query kernel for current policy |
| `sched_rr_get_interval(pid, tp)` | Return SCHED_RR quantum (default 100ms) |
| `pthread_setschedparam(thread, policy, param)` | Set kernel sched policy via proc: scheme |
| `pthread_getschedparam(thread, policy, param)` | Get kernel sched policy |
| `pthread_setschedprio(thread, prio)` | Set dynamic priority within current policy |
| `pthread_getcpuclockid(thread, clock_id)` | Return CPU-time clock for thread |
**Patches:** All in `local/patches/relibc/P5-sched-complete.patch`
---
### Phase R2: Robust and PI Mutex Support (Week 5-9, parallel with S4)
**Goal:** Full POSIX mutex robustness and priority inheritance.
#### R2.1 — PI mutex protocol
```rust
// relibc/src/sync/pthread_mutex.rs
pub struct PthreadMutex {
futex: AtomicU32,
owner: AtomicUsize, // os_tid of current owner
pi_waiters: Mutex<Vec<(OsTid, u32)>>, // waiters with requested priority
flags: AtomicU32, // PTHREAD_PRIO_INHERIT, PTHREAD_MUTEX_ROBUST
}
// Lock with PI:
fn lock_pi(&self) -> Result<(), Errno> {
loop {
match futex::lock_pi(&self.futex) {
Ok(()) => {
self.owner.store(current_tid(), Ordering::Release);
return Ok(());
}
Err(EAGAIN) => continue,
Err(err) => return Err(err),
}
}
}
```
#### R2.2 — Robust mutex protocol
```rust
pub struct RobustList {
head: *mut RobustListHead,
}
pub struct RobustListHead {
list: RobustList,
futex_offset: isize,
pending: *mut RobustListHead,
}
// On thread exit:
fn handle_robust_list(thread: &Pthread) {
for entry in thread.robust_list.iter() {
let futex_addr = (entry as usize + entry.futex_offset) as *mut AtomicU32;
// Set FUTEX_OWNER_DIED
futex_addr.fetch_or(FUTEX_OWNER_DIED, Ordering::Release);
// Wake one waiter with EOWNERDEAD
futex::wake(futex_addr, 1);
}
}
```
---
### Phase R3: Thread Groups and Process Identity (Week 10-12)
**Goal:** Proper tgid/pid distinction, `kill(pid, 0)` process targeting.
#### R3.1 — Kernel thread group concept
```rust
pub struct Context {
pub tgid: usize, // Thread Group ID (= pid for main thread)
pub tid: usize, // Thread ID (unique per thread)
}
```
- On `clone(CLONE_THREAD)`: child gets same tgid as parent, new tid
- On fork: child gets new tgid = child's tid
- `getpid()` returns tgid
- `gettid()` returns tid
- `kill(tgid, sig)` delivers signal to all threads in thread group
#### R3.2 — Thread group signal delivery
```rust
fn deliver_signal_to_thread_group(tgid: usize, sig: Signal) {
for context in contexts_in_thread_group(tgid) {
// Pick a thread that hasn't blocked this signal
if !context.sig_blocked(sig) {
context.deliver_signal(sig);
break;
}
}
}
```
**Patches:**
- `local/patches/kernel/P5-tgid.patch` — thread group ID kernel support
- `local/patches/kernel/P5-tgid-signal.patch` — process-targeted signal delivery
- `local/patches/relibc/P5-gettid.patch` — gettid() syscall
---
## 5. Dependency Chain
```
Phase S1 (observability)
├──► Phase S2 (real-time scheduling) ────┐
│ │ │
│ ├──► Phase R1 (POSIX sched API) │
│ │ │
│ └──► KWin input thread priority │
│ │
├──► Phase S3 (load balancing) ───────────┤
│ │ │
│ └──► Mesa worker thread scaling │
│ │
├──► Phase S4 (futex enhancements) ───────┤
│ │ │
│ ├──► Phase R2 (PI/robust mutex) │
│ │ │
│ └──► KWin compositor lock │
│ │
├──► Phase S5 (dynamic prio + affinity) ──┤
│ │ │
│ └──► Application CPU pinning │
│ │
├──► Phase R3 (thread groups) ────────────┤
│ │ │
│ └──► process-targeted signals │
│ │
└──► Phase S6 (NUMA) ─────────────────────┘
└──► Multi-socket server performance
```
**Independent work (can run in parallel):**
- S2 (RT scheduling) + R1 (POSIX sched API) — parallel
- S4 (futex) + R2 (PI/robust mutex) — parallel
- S3 (load balancing) can start after S1 but independently of S2
- S6 (NUMA) depends on S3 (per-CPU queues) but not on S4/S5
---
## 6. Integration with Existing Plans
| Existing Plan | Relationship |
|---------------|-------------|
| `KERNEL-IPC-CREDENTIAL-PLAN.md` | Sibling — this plan covers scheduler + futex + threading; that plan covers credentials + access control + IPC completeness |
| `RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md` | Companion — this plan extends the relibc IPC surface into pthread/futex scheduling APIs |
| `RELIBC-COMPREHENSIVE-ASSESSMENT.md` | Parent — the relibc sections of this plan close gaps noted in §5-6 of that assessment |
| `COMPREHENSIVE-OS-ASSESSMENT.md` | Parent — this plan closes §2 kernel gaps for scheduler/scalability |
| `CONSOLE-TO-KDE-DESKTOP-PLAN.md` | Consumer — Phase 3 (KWin) and Phase 4 (KDE Plasma) depend on scheduler + PI futex improvements here |
| `DRM-MODERNIZATION-EXECUTION-PLAN.md` | Sibling — GPU worker thread scheduling benefits from load balancing (S3) |
| `IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md` | Sibling — IRQ latency affects scheduling latency |
---
## 7. Patch Governance
All kernel and relibc source changes follow the durability policy (`local/AGENTS.md`):
```
local/patches/
├── kernel/
│ (Delivered: P6-* and P7-* patches below. P5-sched-* entries are planned future carriers.)
│ ├── P5-sched-observability.patch # S1
│ ├── P5-sched-policy.patch # S2
│ ├── P5-sched-policy-proc.patch # S2 proc: scheme
│ ├── P6-percpu-runqueues.patch # S3 (delivered: infrastructure)
│ ├── P6-futex-sharding.patch # S4 (delivered: sharding)
│ ├── P6-vruntime-context.patch # S5 (delivered: field + init)
│ ├── P6-vruntime-switch.patch # S5 (delivered: update + selection)
│ ├── (remaining S3-S6 patches deferred)
├── relibc/
│ ├── P5-sched-observe.patch # R1 baseline
│ ├── P5-sched-setscheduler.patch # R1
│ ├── P5-sched-getscheduler.patch # R1
│ ├── P5-sched-priority.patch # R1
│ ├── P5-sched-complete.patch # R1 remaining stubs
│ ├── (PI/robust mutex deferred) # R2
│ ├── P7-setpriority.patch # S5 (delivered)
│ ├── P7-pthread-affinity.patch # S5 (delivered)
│ └── P5-gettid.patch # R3
```
---
## 8. Validation and Evidence
### 8.1 Build Evidence
| Check | Command |
|-------|---------|
| Kernel compiles | `make r.kernel` |
| relibc compiles | `make r.relibc` |
| Full OS builds | `make all CONFIG_NAME=redbear-full` |
### 8.2 Runtime Evidence
| Test | Verification |
|------|-------------|
| `sched_getscheduler()` returns policy | `redbear-info --sched` |
| `pthread_setschedparam()` changes priority | Threaded test binary: `test-sched-priority` |
| RT thread preempts SCHED_OTHER | Latency test: RT thread wakes within 100μs |
| Work stealing across CPUs | `redbear-info --sched` shows balanced queue depths |
| PI futex prevents priority inversion | PI test: low-prio holder, high-prio waiter, medium-prio contester |
| Robust mutex recovery after thread kill | Robust test: kill thread holding mutex, verify EOWNERDEAD |
| Thread affinity pinning | `taskset`-like test: verify thread stays on assigned CPU |
| Load balancing on fork bomb | Spawn 2× CPUs threads, verify even distribution |
### 8.3 Verification Scripts
```bash
local/scripts/test-sched-qemu.sh # Scheduler metric validation
local/scripts/test-sched-rt-qemu.sh # Real-time scheduling proof
local/scripts/test-futex-pi-qemu.sh # PI futex proof
local/scripts/test-futex-robust-qemu.sh # Robust futex proof
local/scripts/test-sched-balance-qemu.sh # Load balancing proof (multi-vCPU)
```
---
## 9. Bottom Line
The Redox kernel scheduler is **functional but simple** — a correct DWRR implementation that works for a lightly-loaded system. For the KDE/Wayland desktop with dozens of competing threads (compositor, rendering, I/O, timers, D-Bus, input), it needs:
1. **Real-time scheduling** (S2) — for audio and compositor input threads
2. **PI futexes** (S4/R2) — to prevent the compositor lock from being inverted by background work
3. **Load balancing** (S3) — to use all available cores efficiently
4. **Dynamic priority** (S5) — to keep the compositor responsive under CPU load
These four items are the **critical path** to a responsive desktop. The remaining items (NUMA, thread groups, robust mutexes, affinity API) are important for correctness and server-class workloads but not desktop-blocking.
**Total estimated effort:** 13 weeks with 1-2 kernel developers, delivering incremental improvements at each phase boundary.