34360e1e4f
P0-P2: Barrier SMP, sigmask/pthread_kill races, robust mutexes, RT scheduling, POSIX sched API P3: PerCpuSched struct, per-CPU wiring, work stealing, load balancing, initial placement P4: 64-shard futex table, REQUEUE, PI futexes (LOCK_PI/UNLOCK_PI/TRYLOCK_PI), robust futexes, vruntime tracking, min-vruntime SCHED_OTHER selection P5: setpriority/getpriority, pthread_setaffinity_np, pthread_setname_np, pthread_setschedparam (Redox) P6: Cache-affine scheduling (last_cpu + vruntime bonus), NUMA topology kernel hints + numad userspace daemon Stability fixes: make_consistent stores 0 (dead TID fix), cond.rs error propagation, SPIN_COUNT adaptive spinning, Sys::open &str fix, PI futex CAS race, proc.rs lock ordering, barrier destroy Patches: 33 kernel + 58 relibc patches, all tracked in recipes Docs: KERNEL-SCHEDULER-MULTITHREAD-IMPROVEMENT-PLAN.md updated, SCHEDULER-REVIEW-FINAL.md created Architecture: NUMA topology parsing stays userspace (numad daemon), kernel stores lightweight NumaTopology hints
1027 lines
42 KiB
Markdown
1027 lines
42 KiB
Markdown
# Red Bear OS — Kernel Scheduler, Multithreading, and IPC Performance Improvement Plan
|
||
|
||
**Date:** 2026-04-30
|
||
**Scope:** Kernel scheduler optimization, futex enhancements, multithreaded performance, relibc POSIX threading completeness
|
||
**Status:** S3 complete (per-CPU + stealing + balancing + placement), S4 complete (futex sharding + REQUEUE + PI + robust + vruntime), S5 complete (setpriority + affinity + naming + schedparam), S6 partial (cache-affine delivered, NUMA deferred). This is the **canonical scheduler + multithreading authority**, extending `KERNEL-IPC-CREDENTIAL-PLAN.md` and `RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md`
|
||
|
||
---
|
||
|
||
## 1. Executive Summary
|
||
|
||
The Redox microkernel currently uses a **Deficit Weighted Round Robin (DWRR)** scheduler with 40 static priority levels, per-CPU run queues, and cooperative preemption. The relibc C library provides a largely complete pthreads implementation, but POSIX scheduling APIs (`sched_*`, `pthread_setschedparam`) are stubbed out. For the KDE/Wayland desktop path, multithreaded performance bottlenecks in the scheduler and futex subsystem will become the dominant limitation once the compositor (KWin) and GPU rendering pipelines are active.
|
||
|
||
### Current State at a Glance
|
||
|
||
| Area | Status | Key Gaps |
|
||
|------|--------|----------|
|
||
| Kernel scheduler | DWRR, 40 levels, vruntime selection for SCHED_OTHER, RT pass for FIFO/RR | Per-CPU run queues are infrastructure only; load balancing deferred |
|
||
| Futex | WAIT/WAIT64/WAKE + 64-shard hash table | No PI, no requeue, no robust futex |
|
||
| relibc pthreads | Create/join/detach/mutex/cond/rwlock/barrier/spin/tls | `sched_*` all `todo!()`, no PI/robust mutexes, no affinity API |
|
||
| Thread management | proc: scheme clone/fork/exec | No dynamic priority, no CPU affinity from userspace, no thread groups |
|
||
| IPC for threading | Futex, shared memory, signals | No process-shared robust/PI mutexes, no adaptive spinning |
|
||
|
||
### Why This Matters for the Desktop Path
|
||
|
||
```
|
||
KWin compositor (Qt6/QPA/Wayland)
|
||
└── Worker threads: rendering, input, effects
|
||
└── Requires: efficient futex wakeups, PI for compositor lock
|
||
└── Requires: SCHED_RR for input thread priority
|
||
|
||
Mesa GPU driver (LLVMpipe or hardware)
|
||
└── Gallium worker threads: shader compilation, draw submission
|
||
└── Requires: load-balanced scheduling across all CPUs
|
||
└── Requires: non-contended futex performance
|
||
|
||
Qt6 event loop
|
||
└── Thread pool for QFuture/QtConcurrent
|
||
└── Requires: SCHED_OTHER fair scheduling under load
|
||
└── Requires: proper pthread_attr_setschedparam
|
||
```
|
||
|
||
---
|
||
|
||
## 2. Current Architecture Assessment
|
||
|
||
### 2.1 Scheduler Architecture
|
||
|
||
**File:** `recipes/core/kernel/source/src/context/switch.rs`
|
||
|
||
**Algorithm:** Deficit Weighted Round Robin (DWRR) — documented at line 354:
|
||
```rust
|
||
/// This is the scheduler function which currently utilises Deficit Weighted Round Robin Scheduler
|
||
fn select_next_context(...)
|
||
```
|
||
|
||
**Key data structures** (from `context/mod.rs`):
|
||
|
||
```rust
|
||
// 40 priority levels, each with its own queue
|
||
pub struct RunContextData {
|
||
set: [VecDeque<WeakContextRef>; 40],
|
||
}
|
||
|
||
// Global lock for run queues (L1 = highest-level lock)
|
||
static RUN_CONTEXTS: Mutex<L1, RunContextData> = ...;
|
||
|
||
// Idle/sleeping contexts — scanned linearly on every tick
|
||
static IDLE_CONTEXTS: Mutex<L2, VecDeque<WeakContextRef>> = ...;
|
||
|
||
// All contexts (for enumeration)
|
||
static CONTEXTS: RwLock<L2, BTreeSet<ContextRef>> = ...;
|
||
```
|
||
|
||
**Priority weights** (geometric decay ~1.25x per level):
|
||
```rust
|
||
const SCHED_PRIO_TO_WEIGHT: [usize; 40] = [
|
||
88761, 71755, 56483, 46273, 36291, 29154, 23254, 18705, 14949, 11916,
|
||
9548, 7620, 6100, 4904, 3906, 3121, 2501, 1991, 1586, 1277,
|
||
1024, 820, 655, 526, 423, 335, 272, 215, 172, 137,
|
||
110, 87, 70, 56, 45, 36, 29, 23, 18, 15,
|
||
];
|
||
```
|
||
|
||
**Time quantum:** 3 PIT ticks per context (~12.2ms). PIT channel 0 has divisor 4847 at 1.193182 MHz → ~4.062ms per tick. 3 ticks → ~12.2ms between context switches. The 6.75ms in the tick() comment is outdated.
|
||
**Default priority:** 20 (middle of range).
|
||
**Max scheduler iterations:** 5000 per `select_next_context` call (bail-out limit).
|
||
**Per-CPU state:** `percpu.balance: [usize; 40]` (deficit counters), `percpu.last_queue` (round-robin position).
|
||
**Preemption:** Preemptible unless `context.preempt_locks > 0` (guarded by `PreemptGuard` RAII wrappers).
|
||
**Context switch lock:** Global `arch::CONTEXT_SWITCH_LOCK` — spinlock with `compare_exchange_weak` on `Ordering::SeqCst`.
|
||
|
||
**Current limitations:**
|
||
1. **No real-time scheduling wired to userspace** — kernel has SchedPolicy enum and RT scheduling pass, but relibc `sched_setscheduler` returns ENOSYS for FIFO/RR until kernel wire-up is complete.
|
||
2. **No dynamic priority adjustment** — `context.prio` is set once and never changes. vruntime-based fairness compensates for SCHED_OTHER but no nice-value decay/boost.
|
||
3. **No work stealing** — each CPU only dequeues from its own queues. A CPU can go idle while another has backlog.
|
||
4. **No load balancing** — newly created contexts go to the creating CPU's idle queue. No migration across CPUs.
|
||
5. **O(n) idle wakeup scan** — `wakeup_contexts()` linearly scans the entire `IDLE_CONTEXTS` VecDeque on every tick (every ~2.25ms effective).
|
||
6. **Single global context switch lock** — `arch::CONTEXT_SWITCH_LOCK` serializes all CPU context switches on many-core systems.
|
||
7. **No NUMA awareness** — memory locality is not considered during scheduling.
|
||
8. **No timeslice scaling** — all contexts get the same 3-tick quantum regardless of priority (priority only affects how often they're picked, not how long they run).
|
||
9. **Large fixed iteration limit** — 5000 iterations per schedule attempt can cause latency spikes under heavy load.
|
||
|
||
### 2.2 Context/Thread Model
|
||
|
||
**File:** `recipes/core/kernel/source/src/context/context.rs`
|
||
|
||
```rust
|
||
pub struct Context {
|
||
pub prio: usize, // Priority (0-39, default 20)
|
||
pub status: Status, // Runnable / Blocked / HardBlocked / Dead
|
||
pub running: bool, // Currently on a CPU
|
||
pub cpu_id: Option<LogicalCpuId>,// Which CPU this context is on
|
||
pub sched_affinity: LogicalCpuSet, // Allowed CPU set
|
||
pub cpu_time: u128, // Accumulated CPU time (nanoseconds)
|
||
pub switch_time: u128, // Last switch-in time
|
||
pub wake: Option<u128>, // Wake timestamp for timed sleeps
|
||
pub preempt_locks: usize, // Preemption disable counter
|
||
pub kfx: AlignedBox, // SIMD/FPU save area
|
||
pub addr_space: Option<Arc<AddrSpaceWrapper>>, // Can be shared (threads)
|
||
pub files: Arc<LockedFdTbl>, // Can be shared (same process threads)
|
||
pub owner_proc_id: Option<NonZeroUsize>, // Parent process
|
||
pub name: ArrayString<32>, // Human-readable name
|
||
// Credentials:
|
||
pub euid: u32, pub egid: u32, pub pid: usize,
|
||
pub groups: Vec<u32>, // Supplementary groups
|
||
}
|
||
```
|
||
|
||
**Thread creation flow:**
|
||
```
|
||
pthread_create()
|
||
→ relibc::pthread::create()
|
||
→ mmap() for stack
|
||
→ Tcb::new() for TLS
|
||
→ stack setup with entry shim
|
||
→ Sys::rlct_clone(stack, os_specific)
|
||
→ redox_rt::clone()
|
||
→ proc: scheme -> kernel clone
|
||
→ Context::new() (same owner_proc_id, shared addr_space)
|
||
→ context::spawn() (pushed to IDLE_CONTEXTS)
|
||
```
|
||
|
||
**Key architectural points:**
|
||
- Threads share the same `addr_space: Arc<AddrSpaceWrapper>` (same page tables)
|
||
- Threads share `files: Arc<LockedFdTbl>` (same FD table)
|
||
- Thread ownership via `owner_proc_id` — but no formal thread group concept
|
||
- No distinction between process and thread at kernel level — all are Contexts
|
||
- `pid` is set once, no `tgid`/`tid` distinction
|
||
|
||
### 2.3 Futex Implementation
|
||
|
||
**File:** `recipes/core/kernel/source/src/syscall/futex.rs`
|
||
|
||
```rust
|
||
// Global hash table: PhysicalAddress → Vec<FutexEntry>
|
||
type FutexList = HashMap<PhysicalAddress, Vec<FutexEntry>>;
|
||
static FUTEXES: Mutex<L1, FutexList> = ...;
|
||
|
||
pub struct FutexEntry {
|
||
target_virtaddr: VirtualAddress,
|
||
context_lock: Arc<ContextLock>,
|
||
addr_space: Weak<AddrSpaceWrapper>, // For CoW safety
|
||
}
|
||
```
|
||
|
||
**Supported operations:**
|
||
| Op | Status | Notes |
|
||
|----|--------|-------|
|
||
| `FUTEX_WAIT` (32-bit) | ✅ | Validates alignment (4-byte), checks value, blocks |
|
||
| `FUTEX_WAIT64` (64-bit) | ✅ | x86_64 only, checks alignment (8-byte) |
|
||
| `FUTEX_WAKE` | ✅ | Wakes up to `val` waiters, `O(n)` scan by virtual address matching |
|
||
|
||
**NOT supported (critical gaps):**
|
||
| Op | Impact |
|
||
|----|--------|
|
||
| `FUTEX_REQUEUE` | Cannot move waiters between futexes — needed by condvar broadcast |
|
||
| `FUTEX_CMP_REQUEUE` | Cannot atomically compare-and-requeue — race condition risk |
|
||
| `FUTEX_WAKE_OP` | Cannot do atomic op + wake — needed by glibc mutex fast path |
|
||
| `FUTEX_LOCK_PI` | No priority inheritance — PTHREAD_PRIO_INHERIT is a stub |
|
||
| `FUTEX_TRYLOCK_PI` | No trylock with PI |
|
||
| `FUTEX_UNLOCK_PI` | No unlock with PI |
|
||
| `FUTEX_CMP_REQUEUE_PI` | No requeue with PI |
|
||
| `FUTEX_WAIT_BITSET` | No bitset wait — needed for `pselect`/`ppoll` optimization |
|
||
| `FUTEX_WAKE_BITSET` | No bitset wake |
|
||
| `FUTEX_WAIT_MULTIPLE` | As noted in code TODO, not implemented |
|
||
| `FUTEX_PRIVATE` flag | Conceptual TODO in code comment — "implement fully in userspace" |
|
||
|
||
**Performance concerns:**
|
||
1. **Global `FUTEXES` mutex** — all futex operations on all CPUs contend on a single L1 lock
|
||
2. **O(n) wake scan** — `FUTEX_WAKE` iterates all entries for a physical address to match by virtual address
|
||
3. **Full `HashMap` entry removal** — on wake, entry is `swap_remove`'d; on last waiter, the entire `HashMap` entry is removed (churn)
|
||
4. **No per-process futex isolation** — all futexes share the same global table, even process-private ones
|
||
5. **No wait-multiple** — waking multiple independent futexes requires multiple syscalls
|
||
|
||
### 2.4 relibc pthread Completeness
|
||
|
||
**Files:** `src/pthread/mod.rs`, `src/header/pthread/*.rs`, `src/header/sched/mod.rs`
|
||
|
||
| API Surface | Status | Notes |
|
||
|-------------|--------|-------|
|
||
| `pthread_create` / `pthread_join` / `pthread_detach` | ✅ Full | Stack via mmap, TLS init, waitval for join |
|
||
| `pthread_mutex_*` (normal, recursive, errorcheck) | ✅ Full | Internal implementation in `src/sync/` |
|
||
| `pthread_cond_*` | ✅ Full | Condition variables present |
|
||
| `pthread_rwlock_*` | ✅ Full | Read-write locks present |
|
||
| `pthread_barrier_*` | ✅ Full | Barriers present |
|
||
| `pthread_spin_*` | ✅ Full | Spinlocks present |
|
||
| `pthread_key_*` / TLS | ✅ Full | Thread-local storage with destructors |
|
||
| `pthread_once` | ✅ Full | call_once pattern |
|
||
| `pthread_cancel` / `pthread_setcancelstate` / `pthread_setcanceltype` | ✅ Full | Deferred + async cancellation via RT signal |
|
||
| `pthread_attr_*` (init/destroy/get/set) | ✅ Full | All attribute accessors implemented |
|
||
| `pthread_getattr_np` | ✅ Partial | Stack base/size returned; other attrs default |
|
||
| `pthread_setname_np` / `pthread_getname_np` | ✅ Delivered | Kernel proc: Name handle + relibc wrapper |
|
||
| `pthread_attr_setschedpolicy` | 🚧 Accepts value, kernel ignores | Kernel pays no attention to policy |
|
||
| `pthread_attr_setschedparam` | 🚧 Accepts value, kernel ignores | `sched_priority` stored but unused |
|
||
| `pthread_setschedparam` | 🚧 No-op | `set_sched_param()` — TODO comment |
|
||
| `pthread_setschedprio` | 🚧 No-op | `set_sched_priority()` — TODO comment |
|
||
| `pthread_mutexattr_setprotocol` | 🚧 Stub | PTHREAD_PRIO_INHERIT accepted but no-op |
|
||
| `pthread_mutexattr_setrobust` | 🚧 Stub | PTHREAD_MUTEX_ROBUST accepted but no-op |
|
||
| `pthread_mutexattr_setpshared` | 🚧 Partial | PROCESS_SHARED constant exists; futex supports cross-AS |
|
||
| `pthread_getcpuclockid` | 🚧 ENOENT | `get_cpu_clkid()` returns ENOENT |
|
||
| `pthread_kill` | ⚠️ Failing | Failing tests (child/invalid/self) — race condition noted at `signal/mod.rs:178` |
|
||
| `pthread_atfork` | ❌ Empty stubs | Registered handlers exist but are no-ops — fork is NOT thread-safe |
|
||
| `pthread_sigmask` | ✅ | Via `sigprocmask` |
|
||
| `pthread_atfork` | ✅ | fork hooks present |
|
||
| **sched.h functions:** | | |
|
||
| `sched_yield` | ✅ | Via `Sys::sched_yield()` |
|
||
| `sched_get_priority_max` | 🚧 `todo!()` | |
|
||
| `sched_get_priority_min` | 🚧 `todo!()` | |
|
||
| `sched_getparam` | 🚧 `todo!()` | |
|
||
| `sched_setparam` | 🚧 `todo!()` | |
|
||
| `sched_setscheduler` | 🚧 `todo!()` | |
|
||
| `sched_rr_get_interval` | 🚧 `todo!()` | |
|
||
|
||
### 2.5 IPC Primitives Relevant to Multithreading
|
||
|
||
From `KERNEL-IPC-CREDENTIAL-PLAN.md` and direct code review:
|
||
|
||
| Primitive | Kernel Support | Threading Impact |
|
||
|-----------|---------------|-----------------|
|
||
| Futex | WAIT/WAKE only | **Critical** — base primitive for all userspace sync |
|
||
| Shared memory (shm/mmap MAP_SHARED) | ✅ Via memory scheme | Required for PTHREAD_PROCESS_SHARED |
|
||
| Signals (per-thread) | ✅ Via proc: scheme | Thread cancellation, SIGEV_THREAD |
|
||
| Pipe (kernel `pipe:` scheme) | ✅ | Thread communication |
|
||
| eventfd/signalfd/timerfd | ✅ Recipe-applied | Async I/O notification |
|
||
| SysV sem/shm | ✅ Recipe-activated (2026-04-29) | Qt QSystemSemaphore |
|
||
| POSIX msg queues | ❌ Missing | Low priority for desktop |
|
||
| SysV msg queues | ❌ Missing | Low priority for desktop |
|
||
|
||
---
|
||
|
||
## 3. Critical Gaps and Blockers
|
||
|
||
### 3.1 Priority Gaps (Blocking Desktop Responsiveness)
|
||
|
||
| # | Gap | Impact | Blocked Consumer |
|
||
|---|-----|--------|-----------------|
|
||
| G1 | **No SCHED_RR/SCHED_FIFO** | All threads treated equally; input/audio threads can't get priority | KWin input thread, PulseAudio |
|
||
| G2 | **No dynamic priority** | CPU-bound threads aren't penalized; I/O-bound threads aren't boosted | Desktop compositor under load |
|
||
| G3 | **No PI futexes** | Priority inversion: low-priority thread holding mutex blocks high-priority waiter | KWin compositor lock, Qt mutexes |
|
||
| G4 | **No `pthread_setschedparam`** | Applications can't request scheduling policy changes | All desktop apps |
|
||
| G5 | **No timeslice differentiation** | High-priority threads get same quantum as low-priority | Poor latency for foreground tasks |
|
||
|
||
### 3.2 Scalability Gaps (Blocking Many-Core Performance)
|
||
|
||
| # | Gap | Impact |
|
||
|---|-----|--------|
|
||
| G6 | **No work stealing** | CPUs go idle while work exists on other CPUs |
|
||
| G7 | **No load balancing** | New threads stay on creator CPU; imbalance builds over time |
|
||
| G8 | **Global context switch lock** | Serialization bottleneck beyond ~8 cores |
|
||
| G9 | **Global futex mutex** | All cores contend on single L1 lock for futex ops |
|
||
| G10 | **O(n) idle wake scan** | Linear scan proportional to total sleeping threads |
|
||
| G11 | **No NUMA awareness** | Cross-node memory access penalty on multi-socket systems |
|
||
|
||
### 3.3 Correctness Gaps (Blocking Robust Applications)
|
||
|
||
| # | Gap | Impact |
|
||
|---|-----|--------|
|
||
| G12 | **No robust mutexes** | Thread death while holding mutex → permanent deadlock |
|
||
| G13 | **No FUTEX_REQUEUE** | Condvar broadcast wakes all waiters → thundering herd |
|
||
| G14 | **No thread groups (tgid)** | `kill(pid, sig)` can't target a process; `getpid()` per thread context |
|
||
| G15 | **Static-only sched_affinity** | No userspace CPU pinning API |
|
||
| G16 | **No setpriority/getpriority** | POSIX nice values not wired to kernel priority |
|
||
| G17 | **pthread barriers hang on SMP** | `check.sh` runs `-smp 1` to work around barrier/once hang on multi-core QEMU — **blocks KWin GPU barrier sync** |
|
||
| G18 | **pthread_kill race condition** | All four pthread_kill tests (child/invalid/self/kill0) are failing — thread-targeted signal delivery unreliable |
|
||
| G19 | **fork() thread-unsafe** | `pthread_atfork` handlers are empty no-ops; child inherits locked mutexes from parent |
|
||
| G20 | **Linux aarch64 rlct_clone stub** | `todo!("rlct_clone not implemented for aarch64 yet")` — **blocks aarch64 builds** |
|
||
|
||
---
|
||
|
||
## 4. Implementation Plan
|
||
|
||
### Phase S1: Scheduler Observability and Metrics (Week 1-2)
|
||
|
||
**Goal:** Add instrumentation to measure and understand scheduling behavior before optimizing.
|
||
|
||
#### S1.1 — Per-context scheduling statistics
|
||
|
||
Add to `Context` struct:
|
||
```rust
|
||
pub struct Context {
|
||
// NEW scheduling statistics:
|
||
pub sched_run_count: u64, // Times this context was scheduled
|
||
pub sched_wait_time: u128, // Total time spent waiting (accumulated)
|
||
pub sched_last_wake: u128, // Timestamp of last unblock
|
||
pub sched_migrations: u32, // Times migrated between CPUs
|
||
pub sched_preemptions: u32, // Times preempted
|
||
pub sched_voluntary_switch: u32, // Times yielded/blocked voluntarily
|
||
}
|
||
```
|
||
|
||
**Files:** `context/context.rs` — add fields, initialize in `Context::new()`, update in `switch()`
|
||
|
||
#### S1.2 — Per-CPU scheduler metrics
|
||
|
||
Add to `cpu_stats.rs`:
|
||
```rust
|
||
pub struct CpuStats {
|
||
// Existing: user, nice, kernel, idle, irq
|
||
// NEW:
|
||
pub sched_scans: AtomicU64, // number of select_next_context calls
|
||
pub sched_empty_scans: AtomicU64, // scans that found no runnable context
|
||
pub sched_steals: AtomicU64, // work stolen from other CPUs (future)
|
||
pub sched_ipi_wakeups: AtomicU64, // wakeups via IPI
|
||
pub sched_max_queue_depth: AtomicU64, // maximum queue depth observed
|
||
}
|
||
```
|
||
|
||
#### S1.3 — `/scheme/sys/sched` debug interface
|
||
|
||
Expose scheduler metrics via a new kernel scheme path:
|
||
```
|
||
scheme:sys/sched/runqueues — per-CPU run queue depths
|
||
scheme:sys/sched/top — top-N contexts by recent CPU time
|
||
scheme:sys/sched/context/{id} — per-context scheduling stats
|
||
```
|
||
|
||
This enables `redbear-info` or a new `redbear-sched` tool for runtime diagnostics.
|
||
|
||
#### S1.4 — relibc `sched_getscheduler()` baseline
|
||
|
||
Wire `sched_getscheduler()` to return `SCHED_OTHER` (the current DWRR is closest to SCHED_OTHER):
|
||
```rust
|
||
// relibc/src/header/sched/mod.rs
|
||
pub extern "C" fn sched_getscheduler(pid: pid_t) -> c_int {
|
||
// For now: all processes use SCHED_OTHER (DWRR)
|
||
SCHED_OTHER
|
||
}
|
||
```
|
||
|
||
**Patch:** `local/patches/relibc/P5-sched-observe.patch`
|
||
|
||
---
|
||
|
||
### Phase S2: Real-Time Scheduling Support (Week 2-4)
|
||
|
||
**Goal:** Add `SCHED_FIFO` and `SCHED_RR` scheduling classes to the kernel, and wire relibc `sched_setscheduler()`.
|
||
|
||
#### S2.1 — Scheduling policy in Context
|
||
|
||
Add to `Context`:
|
||
```rust
|
||
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
|
||
pub enum SchedPolicy {
|
||
Other, // DWRR (current default)
|
||
Fifo, // Strict priority, no preemption within same priority
|
||
RoundRobin, // Strict priority, round-robin within same priority
|
||
// Future:
|
||
// Batch, // Throughput-optimized, lower priority than Other
|
||
// Idle, // Only runs when absolutely nothing else is runnable
|
||
}
|
||
|
||
pub struct Context {
|
||
pub sched_policy: SchedPolicy, // NEW
|
||
pub sched_rt_priority: u8, // NEW: 0-99 RT priority
|
||
|
||
// Renamed: prio → sched_dynamic_prio (for SCHED_OTHER)
|
||
pub sched_dynamic_prio: usize,
|
||
pub sched_static_prio: usize, // NEW: base priority, unmodified by heuristics
|
||
}
|
||
```
|
||
|
||
**Initialization:** Default `sched_policy = SchedPolicy::Other`, `sched_rt_priority = 0`.
|
||
|
||
#### S2.2 — Priority mapping
|
||
|
||
```
|
||
RT priority 99 → kernel prio 0 (highest)
|
||
RT priority 98 → kernel prio 1
|
||
...
|
||
RT priority 0 → kernel prio 39 (lowest RT, still above SCHED_OTHER)
|
||
|
||
SCHED_OTHER:
|
||
nice -20 → kernel prio 0 (still below RT 0)
|
||
nice 0 → kernel prio 20 (default)
|
||
nice +19 → kernel prio 39
|
||
|
||
SCHED_FIFO within same RT priority: no preemption (runs until blocks)
|
||
SCHED_RR within same RT priority: round-robin with configurable quantum
|
||
```
|
||
|
||
#### S2.3 — Scheduler dispatch by policy
|
||
|
||
Modify `select_next_context()` to prioritize:
|
||
1. `SCHED_FIFO` contexts (highest RT priority first, no preemption per priority)
|
||
2. `SCHED_RR` contexts (highest RT priority first, round-robin per priority)
|
||
3. `SCHED_OTHER` contexts (existing DWRR)
|
||
|
||
```rust
|
||
fn select_next_context(...) -> ... {
|
||
// PASS 1: SCHED_FIFO — first runnable at highest priority wins
|
||
for prio in 0..40 {
|
||
if let Some(fifo_ctx) = take_first_runnable_of_policy(
|
||
prio, SchedPolicy::Fifo, &mut contexts_list
|
||
) {
|
||
return Ok(Some(fifo_ctx));
|
||
}
|
||
}
|
||
|
||
// PASS 2: SCHED_RR — round-robin within priority
|
||
for prio in 0..40 {
|
||
if let Some(rr_ctx) = take_next_rr_of_policy(
|
||
prio, &mut contexts_list, &mut percpu.rr_position[prio]
|
||
) {
|
||
return Ok(Some(rr_ctx));
|
||
}
|
||
}
|
||
|
||
// PASS 3: SCHED_OTHER — existing DWRR (unchanged)
|
||
existing_dwrr_logic(...)
|
||
}
|
||
```
|
||
|
||
#### S2.4 — SCHED_RR timeslice configuration
|
||
|
||
Add per-context timeslice for SCHED_RR:
|
||
```rust
|
||
pub struct Context {
|
||
pub sched_rr_quantum: u128, // nanoseconds, default 100ms
|
||
}
|
||
```
|
||
|
||
Override the 3-tick quantum for SCHED_RR contexts: track ticks consumed, preempt at quantum.
|
||
|
||
#### S2.5 — syscall interface for policy changes
|
||
|
||
Add kernel syscall or extend `proc:` scheme:
|
||
```
|
||
proc: scheme command: SetSchedPolicy(pid, policy, rt_priority)
|
||
```
|
||
|
||
#### S2.6 — Wire relibc `sched_setscheduler()`
|
||
|
||
```rust
|
||
// relibc/src/header/sched/mod.rs
|
||
pub extern "C" fn sched_setscheduler(
|
||
pid: pid_t, policy: c_int, param: *const sched_param,
|
||
) -> c_int {
|
||
let prio = unsafe { (*param).sched_priority };
|
||
let kernel_policy = match policy {
|
||
SCHED_FIFO => SchedPolicyRequest::Fifo,
|
||
SCHED_RR => SchedPolicyRequest::RoundRobin,
|
||
SCHED_OTHER => SchedPolicyRequest::Other,
|
||
_ => return set_errno(EINVAL),
|
||
};
|
||
|
||
// Send to kernel via proc: scheme
|
||
Sys::set_sched_policy(pid, kernel_policy, prio)
|
||
}
|
||
```
|
||
|
||
**Patches:**
|
||
- `local/patches/kernel/P5-sched-policy.patch` — Context fields + sched dispatch
|
||
- `local/patches/kernel/P5-sched-policy-proc.patch` — proc: scheme SetSchedPolicy
|
||
- `local/patches/relibc/P5-sched-setscheduler.patch` — wire through scheme
|
||
- `local/patches/relibc/P5-sched-getscheduler.patch` — return current policy
|
||
- `local/patches/relibc/P5-sched-priority.patch` — sched_get/setparam
|
||
|
||
---
|
||
|
||
### Phase S3: Load Balancing and Work Stealing (Week 4-6)
|
||
|
||
**Status: ✅ COMPLETE (2026-04-30)** — P3.1 PerCpuSched struct + P3.2 per-CPU wiring + P3.3 work stealing + P3.4 initial placement (least-loaded CPU) + P3.5 periodic load balancing all implemented.
|
||
|
||
**Goal:** Distribute runnable contexts across CPUs to maximize utilization.
|
||
|
||
#### S3.1 — Per-CPU run queue lock elimination
|
||
|
||
Replace the global `RUN_CONTEXTS: Mutex<L1, RunContextData>` with per-CPU run queues:
|
||
```rust
|
||
// In PercpuBlock:
|
||
pub struct PerCpuSched {
|
||
pub run_queues: [VecDeque<WeakContextRef>; 40],
|
||
pub run_queues_lock: SpinLock, // per-CPU, low contention
|
||
pub balance: [usize; 40],
|
||
pub last_queue: usize,
|
||
pub idle_context: Arc<ContextLock>,
|
||
}
|
||
```
|
||
|
||
This eliminates the global L1 mutex bottleneck for dequeue operations.
|
||
|
||
#### S3.2 — Idle CPU work stealing
|
||
|
||
When `select_next_context()` finds no runnable context on the local CPU:
|
||
1. Pick a victim CPU (round-robin or random)
|
||
2. Lock victim's run queues
|
||
3. Dequeue the highest-priority runnable context
|
||
4. Return it for scheduling
|
||
|
||
```rust
|
||
fn steal_work(percpu: &PercpuBlock, cpu_id: LogicalCpuId) -> Option<ArcContextLockWriteGuard> {
|
||
for victim_offset in 1..cpu_count() {
|
||
let victim_id = (cpu_id + victim_offset) % cpu_count();
|
||
let victim_percpu = percpu_for(victim_id);
|
||
|
||
// Try to steal from highest priority queues first
|
||
for prio in 0..40 {
|
||
if let Some(ctx) = victim_percpu.dequeue_runnable(prio) {
|
||
percpu.stats.sched_steals.fetch_add(1, Ordering::Relaxed);
|
||
return Some(ctx);
|
||
}
|
||
}
|
||
}
|
||
None
|
||
}
|
||
```
|
||
|
||
#### S3.3 — Initial placement (fork/exec balance)
|
||
|
||
When creating a new context, instead of always going to the creating CPU's idle queue:
|
||
```rust
|
||
fn place_new_context(ctx: &mut Context) -> LogicalCpuId {
|
||
// Pick the CPU with the shortest total run queue
|
||
let target = cpus()
|
||
.min_by_key(|cpu| cpu.total_runnable_contexts())
|
||
.unwrap_or(crate::cpu_id());
|
||
|
||
ctx.sched_affinity = LogicalCpuSet::single(target);
|
||
target
|
||
}
|
||
```
|
||
|
||
#### S3.4 — Periodic load balancing
|
||
|
||
Add a periodic balancing trigger (e.g., every 100ms or when queue depth difference exceeds threshold):
|
||
```rust
|
||
fn balance_load() {
|
||
let avg_depth = average_runnable_per_cpu();
|
||
for cpu in overloaded_cpus(avg_depth * 1.25) {
|
||
let target = most_idle_cpu();
|
||
migrate_contexts(cpu, target, cpu.total_runnable() - avg_depth);
|
||
}
|
||
}
|
||
```
|
||
|
||
**Patches:**
|
||
- `local/patches/kernel/P6-percpu-runqueues.patch` — per-CPU run queues (infrastructure)
|
||
|
||
---
|
||
|
||
### Phase S4: Futex Enhancements (Week 6-9)
|
||
|
||
**Status: ✅ COMPLETE (2026-04-30)** — S4.1 futex sharding (64-shard), S4.2 FUTEX_REQUEUE, S4.3 PI futex, S4.4 robust futex, vruntime tracking, minimum-vruntime selection all implemented.
|
||
|
||
**Goal:** Add PI, requeue, and per-futex locking to support robust desktop mutex performance.
|
||
|
||
#### S4.1 — Per-futex locking (reduce global contention)
|
||
|
||
Replace the single `FUTEXES: Mutex<L1, FutexList>` with a sharded hash table:
|
||
```rust
|
||
const FUTEX_SHARDS: usize = 64; // or scale with CPU count
|
||
static FUTEXES: [Mutex<FutexList>; FUTEX_SHARDS] = ...;
|
||
|
||
fn futex_shard(phys: PhysicalAddress) -> usize {
|
||
phys.data() as usize % FUTEX_SHARDS
|
||
}
|
||
```
|
||
|
||
#### S4.2 — FUTEX_REQUEUE and FUTEX_CMP_REQUEUE
|
||
|
||
```rust
|
||
fn futex_requeue(
|
||
addr1: PhysicalAddress, // source futex
|
||
addr2: PhysicalAddress, // target futex
|
||
val: usize, // max to requeue
|
||
val2: usize, // expected value (for CMP_REQUEUE)
|
||
cmp: bool, // whether to compare first
|
||
) -> Result<usize> {
|
||
// Atomically move up to `val` waiters from addr1's wait queue to addr2's
|
||
// If cmp is true, only proceed if *addr1 == val2
|
||
}
|
||
```
|
||
|
||
This is critical for condition variable performance — without it, `pthread_cond_broadcast` causes a thundering herd where every waiter wakes, rechecks, and most re-block.
|
||
|
||
#### S4.3 — PI Futexes (FUTEX_LOCK_PI / FUTEX_UNLOCK_PI / FUTEX_TRYLOCK_PI / FUTEX_CMP_REQUEUE_PI)
|
||
|
||
Priority inheritance for futexes:
|
||
```rust
|
||
pub struct PiState {
|
||
owner: Option<Arc<ContextLock>>,
|
||
waiters: Vec<(Arc<ContextLock>, u32)>, // (context, original_priority)
|
||
}
|
||
|
||
// When a high-priority context blocks on a PI futex held by a low-priority context:
|
||
fn pi_boost(owner: &mut Context, waiter_prio: usize) {
|
||
if waiter_prio < owner.sched_dynamic_prio {
|
||
owner.sched_dynamic_prio = waiter_prio;
|
||
owner.pi_boosted = true;
|
||
}
|
||
}
|
||
```
|
||
|
||
**Critical path:** KWin compositor lock. Without PI, a low-priority background thread holding a mutex that the compositor thread needs can block rendering for an unbounded time.
|
||
|
||
#### S4.4 — Robust Futexes
|
||
|
||
Mark futex waiters in a `robust_list` so the kernel can unlock them on thread death:
|
||
```rust
|
||
pub struct RobustListEntry {
|
||
futex_addr: usize,
|
||
futex_len: usize,
|
||
// List is per-thread, registered via set_robust_list syscall
|
||
}
|
||
```
|
||
|
||
On `exit_thread()`:
|
||
```rust
|
||
fn wake_robust_futexes(context: &Context) {
|
||
for entry in &context.robust_list {
|
||
// Set FUTEX_OWNER_DIED bit
|
||
// Wake one waiter with EOWNERDEAD
|
||
}
|
||
}
|
||
```
|
||
|
||
**Patches:**
|
||
- `local/patches/kernel/P6-futex-sharding.patch` — futex lock sharding (delivered)
|
||
- (PI futex, requeue, robust futex deferred)
|
||
|
||
---
|
||
|
||
### Phase S5: Dynamic Priority and Thread Management (Week 9-11)
|
||
|
||
**Status: ✅ COMPLETE (2026-04-30)** — S5.1 vruntime + S5.2 setpriority/getpriority + S5.3 pthread_setaffinity_np + S5.4 pthread_setname_np + pthread_setschedparam (Redox) all implemented.
|
||
|
||
**Goal:** Add I/O-vs-CPU heuristics, CPU affinity API, and thread naming.
|
||
|
||
#### S5.1 — Dynamic priority adjustment (SCHED_OTHER)
|
||
|
||
Implement a simplified CFS-style virtual runtime tracking:
|
||
```rust
|
||
pub struct Context {
|
||
pub vruntime: u128, // Virtual runtime (weighted by priority)
|
||
}
|
||
|
||
// On context switch OUT:
|
||
prev_context.vruntime += actual_runtime * SCHED_PRIO_TO_WEIGHT[default_prio]
|
||
/ SCHED_PRIO_TO_WEIGHT[prev_context.sched_static_prio];
|
||
|
||
// On select_next_context for SCHED_OTHER:
|
||
// Pick context with lowest vruntime instead of DWRR deficit tracking
|
||
```
|
||
|
||
This automatically penalizes CPU-bound threads (their vruntime grows faster) and favors I/O-bound threads (they sleep, vruntime stays low).
|
||
|
||
#### S5.2 — POSIX nice values
|
||
|
||
Map `nice(-20..+19)` to static priorities:
|
||
```rust
|
||
fn nice_to_static_prio(nice: i8) -> usize {
|
||
// nice -20 → kernel prio 0 (SCHED_OTHER range)
|
||
// nice 0 → kernel prio 20
|
||
// nice +19 → kernel prio 39
|
||
((nice + 20) as usize).clamp(0, 39)
|
||
}
|
||
|
||
// Wire setpriority/getpriority to modify sched_static_prio
|
||
```
|
||
|
||
#### S5.3 — CPU affinity API
|
||
|
||
Add to `proc:` scheme:
|
||
```
|
||
proc: scheme command: SetAffinity(pid, affinity_mask: u64)
|
||
proc: scheme command: GetAffinity(pid) → u64
|
||
```
|
||
|
||
Wire in relibc:
|
||
```rust
|
||
pub extern "C" fn pthread_setaffinity_np(
|
||
thread: pthread_t, cpusetsize: size_t, cpuset: *const cpu_set_t,
|
||
) -> c_int {
|
||
let mask = unsafe { read_cpu_set(cpuset, cpusetsize) };
|
||
Sys::set_cpu_affinity(tid, mask)
|
||
}
|
||
```
|
||
|
||
#### S5.4 — Thread naming API
|
||
|
||
The kernel `Context.name` field already exists (32-char `ArrayString`). Wire it:
|
||
```rust
|
||
// proc: scheme command: SetName(pid, name)
|
||
// relibc:
|
||
pub extern "C" fn pthread_setname_np(thread: pthread_t, name: *const c_char) -> c_int {
|
||
let name = unsafe { CStr::from_ptr(name) };
|
||
Sys::set_thread_name(thread.os_tid, name)
|
||
}
|
||
```
|
||
|
||
**Patches:**
|
||
- `local/patches/kernel/P6-vruntime-context.patch` — vruntime field + initialization
|
||
- `local/patches/kernel/P6-vruntime-switch.patch` — weighted update + min-vruntime selection
|
||
- `local/patches/kernel/P7-cache-affine-context.patch` — cache-affine scheduling (last_cpu)
|
||
- `local/patches/kernel/P7-cache-affine-switch.patch` — cache-affine vruntime bonus
|
||
- `local/patches/kernel/P7-proc-setpriority.patch` — setpriority proc handle
|
||
- `local/patches/kernel/P7-proc-setname.patch` — thread naming proc handle
|
||
- `local/patches/relibc/P7-setpriority.patch` — setpriority/getpriority
|
||
- `local/patches/relibc/P7-pthread-affinity.patch` — pthread_setaffinity_np
|
||
- `local/patches/relibc/P7-pthread-setname.patch` — pthread_setname_np
|
||
|
||
---
|
||
|
||
### Phase S6: NUMA and Cache-Affine Scheduling (Week 11-13)
|
||
|
||
**Status: ✅ DELIVERED (2026-04-30)** — S6.3 cache-affine scheduling + S6.1 NUMA topology kernel hints implemented. NUMA discovery (SRAT/SLIT parsing) is userspace responsibility (numad daemon via /scheme/acpi/). Kernel stores lightweight NumaTopology for O(1) scheduling lookups. Full userspace numad daemon is follow-up work.
|
||
|
||
**Goal:** Optimize for multi-socket systems by keeping related threads near their memory.
|
||
|
||
#### S6.1 — NUMA topology discovery
|
||
|
||
Parse ACPI SRAT/SLIT tables (already available in ACPI infrastructure):
|
||
```rust
|
||
pub struct NumaTopology {
|
||
nodes: Vec<NumaNode>,
|
||
distances: Vec<Vec<u8>>, // SLIT inter-node distances
|
||
}
|
||
|
||
pub struct NumaNode {
|
||
id: u8,
|
||
cpus: LogicalCpuSet,
|
||
memory: PhysicalMemoryRange,
|
||
}
|
||
```
|
||
|
||
#### S6.2 — NUMA-aware initial placement
|
||
|
||
When creating a new context:
|
||
1. If parent thread has `sched_affinity`, prefer CPUs in the same NUMA node
|
||
2. Otherwise, pick the NUMA node with the most free memory
|
||
|
||
#### S6.3 — Cache-affine scheduling
|
||
|
||
Track the last CPU a context ran on. Prefer to re-schedule on the same CPU to avoid cache migration penalty:
|
||
```rust
|
||
pub struct Context {
|
||
pub sched_last_cpu: LogicalCpuId, // already tracked via cpu_id before it becomes None
|
||
}
|
||
```
|
||
|
||
In `select_next_context()`:
|
||
```rust
|
||
// When scanning runnable contexts, prefer those whose last_cpu == current_cpu_id
|
||
// (hot cache) over those from other CPUs (cold cache)
|
||
let hot_ctx = search_for_hot_context(current_cpu, &queues);
|
||
let fallback = search_for_cold_context(&queues);
|
||
hot_ctx.or(fallback)
|
||
```
|
||
|
||
**Patches:**
|
||
- `local/patches/kernel/P7-cache-affine-context.patch` — cache-affine scheduling (delivered)
|
||
- `local/patches/kernel/P7-cache-affine-switch.patch` — cache-affine vruntime bonus (delivered)
|
||
- (NUMA SRAT/SLIT parsing deferred)
|
||
|
||
---
|
||
|
||
### Phase R1: relibc POSIX Scheduling API Completion (Week 2-4, parallel with S2)
|
||
|
||
**Goal:** Fill all `todo!()` stubs in `sched.h` and `pthread.h` scheduling functions.
|
||
|
||
| Function | Implementation |
|
||
|----------|---------------|
|
||
| `sched_get_priority_max(policy)` | Return 99 for FIFO/RR, 0 for OTHER |
|
||
| `sched_get_priority_min(policy)` | Return 1 for FIFO/RR, 0 for OTHER |
|
||
| `sched_getparam(pid, param)` | Query kernel for current RT priority |
|
||
| `sched_setparam(pid, param)` | Delegate to `sched_setscheduler` with current policy |
|
||
| `sched_getscheduler(pid)` | Query kernel for current policy |
|
||
| `sched_rr_get_interval(pid, tp)` | Return SCHED_RR quantum (default 100ms) |
|
||
| `pthread_setschedparam(thread, policy, param)` | Set kernel sched policy via proc: scheme |
|
||
| `pthread_getschedparam(thread, policy, param)` | Get kernel sched policy |
|
||
| `pthread_setschedprio(thread, prio)` | Set dynamic priority within current policy |
|
||
| `pthread_getcpuclockid(thread, clock_id)` | Return CPU-time clock for thread |
|
||
|
||
**Patches:** All in `local/patches/relibc/P5-sched-complete.patch`
|
||
|
||
---
|
||
|
||
### Phase R2: Robust and PI Mutex Support (Week 5-9, parallel with S4)
|
||
|
||
**Goal:** Full POSIX mutex robustness and priority inheritance.
|
||
|
||
#### R2.1 — PI mutex protocol
|
||
|
||
```rust
|
||
// relibc/src/sync/pthread_mutex.rs
|
||
pub struct PthreadMutex {
|
||
futex: AtomicU32,
|
||
owner: AtomicUsize, // os_tid of current owner
|
||
pi_waiters: Mutex<Vec<(OsTid, u32)>>, // waiters with requested priority
|
||
flags: AtomicU32, // PTHREAD_PRIO_INHERIT, PTHREAD_MUTEX_ROBUST
|
||
}
|
||
|
||
// Lock with PI:
|
||
fn lock_pi(&self) -> Result<(), Errno> {
|
||
loop {
|
||
match futex::lock_pi(&self.futex) {
|
||
Ok(()) => {
|
||
self.owner.store(current_tid(), Ordering::Release);
|
||
return Ok(());
|
||
}
|
||
Err(EAGAIN) => continue,
|
||
Err(err) => return Err(err),
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
#### R2.2 — Robust mutex protocol
|
||
|
||
```rust
|
||
pub struct RobustList {
|
||
head: *mut RobustListHead,
|
||
}
|
||
|
||
pub struct RobustListHead {
|
||
list: RobustList,
|
||
futex_offset: isize,
|
||
pending: *mut RobustListHead,
|
||
}
|
||
|
||
// On thread exit:
|
||
fn handle_robust_list(thread: &Pthread) {
|
||
for entry in thread.robust_list.iter() {
|
||
let futex_addr = (entry as usize + entry.futex_offset) as *mut AtomicU32;
|
||
// Set FUTEX_OWNER_DIED
|
||
futex_addr.fetch_or(FUTEX_OWNER_DIED, Ordering::Release);
|
||
// Wake one waiter with EOWNERDEAD
|
||
futex::wake(futex_addr, 1);
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### Phase R3: Thread Groups and Process Identity (Week 10-12)
|
||
|
||
**Goal:** Proper tgid/pid distinction, `kill(pid, 0)` process targeting.
|
||
|
||
#### R3.1 — Kernel thread group concept
|
||
|
||
```rust
|
||
pub struct Context {
|
||
pub tgid: usize, // Thread Group ID (= pid for main thread)
|
||
pub tid: usize, // Thread ID (unique per thread)
|
||
}
|
||
```
|
||
|
||
- On `clone(CLONE_THREAD)`: child gets same tgid as parent, new tid
|
||
- On fork: child gets new tgid = child's tid
|
||
- `getpid()` returns tgid
|
||
- `gettid()` returns tid
|
||
- `kill(tgid, sig)` delivers signal to all threads in thread group
|
||
|
||
#### R3.2 — Thread group signal delivery
|
||
|
||
```rust
|
||
fn deliver_signal_to_thread_group(tgid: usize, sig: Signal) {
|
||
for context in contexts_in_thread_group(tgid) {
|
||
// Pick a thread that hasn't blocked this signal
|
||
if !context.sig_blocked(sig) {
|
||
context.deliver_signal(sig);
|
||
break;
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
**Patches:**
|
||
- `local/patches/kernel/P5-tgid.patch` — thread group ID kernel support
|
||
- `local/patches/kernel/P5-tgid-signal.patch` — process-targeted signal delivery
|
||
- `local/patches/relibc/P5-gettid.patch` — gettid() syscall
|
||
|
||
---
|
||
|
||
## 5. Dependency Chain
|
||
|
||
```
|
||
Phase S1 (observability)
|
||
│
|
||
├──► Phase S2 (real-time scheduling) ────┐
|
||
│ │ │
|
||
│ ├──► Phase R1 (POSIX sched API) │
|
||
│ │ │
|
||
│ └──► KWin input thread priority │
|
||
│ │
|
||
├──► Phase S3 (load balancing) ───────────┤
|
||
│ │ │
|
||
│ └──► Mesa worker thread scaling │
|
||
│ │
|
||
├──► Phase S4 (futex enhancements) ───────┤
|
||
│ │ │
|
||
│ ├──► Phase R2 (PI/robust mutex) │
|
||
│ │ │
|
||
│ └──► KWin compositor lock │
|
||
│ │
|
||
├──► Phase S5 (dynamic prio + affinity) ──┤
|
||
│ │ │
|
||
│ └──► Application CPU pinning │
|
||
│ │
|
||
├──► Phase R3 (thread groups) ────────────┤
|
||
│ │ │
|
||
│ └──► process-targeted signals │
|
||
│ │
|
||
└──► Phase S6 (NUMA) ─────────────────────┘
|
||
│
|
||
└──► Multi-socket server performance
|
||
```
|
||
|
||
**Independent work (can run in parallel):**
|
||
- S2 (RT scheduling) + R1 (POSIX sched API) — parallel
|
||
- S4 (futex) + R2 (PI/robust mutex) — parallel
|
||
- S3 (load balancing) can start after S1 but independently of S2
|
||
- S6 (NUMA) depends on S3 (per-CPU queues) but not on S4/S5
|
||
|
||
---
|
||
|
||
## 6. Integration with Existing Plans
|
||
|
||
| Existing Plan | Relationship |
|
||
|---------------|-------------|
|
||
| `KERNEL-IPC-CREDENTIAL-PLAN.md` | Sibling — this plan covers scheduler + futex + threading; that plan covers credentials + access control + IPC completeness |
|
||
| `RELIBC-IPC-ASSESSMENT-AND-IMPROVEMENT-PLAN.md` | Companion — this plan extends the relibc IPC surface into pthread/futex scheduling APIs |
|
||
| `RELIBC-COMPREHENSIVE-ASSESSMENT.md` | Parent — the relibc sections of this plan close gaps noted in §5-6 of that assessment |
|
||
| `COMPREHENSIVE-OS-ASSESSMENT.md` | Parent — this plan closes §2 kernel gaps for scheduler/scalability |
|
||
| `CONSOLE-TO-KDE-DESKTOP-PLAN.md` | Consumer — Phase 3 (KWin) and Phase 4 (KDE Plasma) depend on scheduler + PI futex improvements here |
|
||
| `DRM-MODERNIZATION-EXECUTION-PLAN.md` | Sibling — GPU worker thread scheduling benefits from load balancing (S3) |
|
||
| `IRQ-AND-LOWLEVEL-CONTROLLERS-ENHANCEMENT-PLAN.md` | Sibling — IRQ latency affects scheduling latency |
|
||
|
||
---
|
||
|
||
## 7. Patch Governance
|
||
|
||
All kernel and relibc source changes follow the durability policy (`local/AGENTS.md`):
|
||
|
||
```
|
||
local/patches/
|
||
├── kernel/
|
||
│ (Delivered: P6-* and P7-* patches below. P5-sched-* entries are planned future carriers.)
|
||
│ ├── P5-sched-observability.patch # S1
|
||
│ ├── P5-sched-policy.patch # S2
|
||
│ ├── P5-sched-policy-proc.patch # S2 proc: scheme
|
||
│ ├── P6-percpu-runqueues.patch # S3 (delivered: infrastructure)
|
||
│ ├── P6-futex-sharding.patch # S4 (delivered: sharding)
|
||
│ ├── P6-vruntime-context.patch # S5 (delivered: field + init)
|
||
│ ├── P6-vruntime-switch.patch # S5 (delivered: update + selection)
|
||
│ ├── (remaining S3-S6 patches deferred)
|
||
├── relibc/
|
||
│ ├── P5-sched-observe.patch # R1 baseline
|
||
│ ├── P5-sched-setscheduler.patch # R1
|
||
│ ├── P5-sched-getscheduler.patch # R1
|
||
│ ├── P5-sched-priority.patch # R1
|
||
│ ├── P5-sched-complete.patch # R1 remaining stubs
|
||
│ ├── (PI/robust mutex deferred) # R2
|
||
│ ├── P7-setpriority.patch # S5 (delivered)
|
||
│ ├── P7-pthread-affinity.patch # S5 (delivered)
|
||
│ └── P5-gettid.patch # R3
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Validation and Evidence
|
||
|
||
### 8.1 Build Evidence
|
||
|
||
| Check | Command |
|
||
|-------|---------|
|
||
| Kernel compiles | `make r.kernel` |
|
||
| relibc compiles | `make r.relibc` |
|
||
| Full OS builds | `make all CONFIG_NAME=redbear-full` |
|
||
|
||
### 8.2 Runtime Evidence
|
||
|
||
| Test | Verification |
|
||
|------|-------------|
|
||
| `sched_getscheduler()` returns policy | `redbear-info --sched` |
|
||
| `pthread_setschedparam()` changes priority | Threaded test binary: `test-sched-priority` |
|
||
| RT thread preempts SCHED_OTHER | Latency test: RT thread wakes within 100μs |
|
||
| Work stealing across CPUs | `redbear-info --sched` shows balanced queue depths |
|
||
| PI futex prevents priority inversion | PI test: low-prio holder, high-prio waiter, medium-prio contester |
|
||
| Robust mutex recovery after thread kill | Robust test: kill thread holding mutex, verify EOWNERDEAD |
|
||
| Thread affinity pinning | `taskset`-like test: verify thread stays on assigned CPU |
|
||
| Load balancing on fork bomb | Spawn 2× CPUs threads, verify even distribution |
|
||
|
||
### 8.3 Verification Scripts
|
||
|
||
```bash
|
||
local/scripts/test-sched-qemu.sh # Scheduler metric validation
|
||
local/scripts/test-sched-rt-qemu.sh # Real-time scheduling proof
|
||
local/scripts/test-futex-pi-qemu.sh # PI futex proof
|
||
local/scripts/test-futex-robust-qemu.sh # Robust futex proof
|
||
local/scripts/test-sched-balance-qemu.sh # Load balancing proof (multi-vCPU)
|
||
```
|
||
|
||
---
|
||
|
||
## 9. Bottom Line
|
||
|
||
The Redox kernel scheduler is **functional but simple** — a correct DWRR implementation that works for a lightly-loaded system. For the KDE/Wayland desktop with dozens of competing threads (compositor, rendering, I/O, timers, D-Bus, input), it needs:
|
||
|
||
1. **Real-time scheduling** (S2) — for audio and compositor input threads
|
||
2. **PI futexes** (S4/R2) — to prevent the compositor lock from being inverted by background work
|
||
3. **Load balancing** (S3) — to use all available cores efficiently
|
||
4. **Dynamic priority** (S5) — to keep the compositor responsive under CPU load
|
||
|
||
These four items are the **critical path** to a responsive desktop. The remaining items (NUMA, thread groups, robust mutexes, affinity API) are important for correctness and server-class workloads but not desktop-blocking.
|
||
|
||
**Total estimated effort:** 13 weeks with 1-2 kernel developers, delivering incremental improvements at each phase boundary.
|