Two-sided fix for the lock-ordering deadlock discovered by
Oracle review (Issue 24):
1. wakeup_contexts (this fn) held IDLE_CONTEXTS while
waiting for SchedQueuesLock on its own CPU via
SchedQueuesLock::new(&percpu.sched). If another CPU's
steal_work was holding that SchedQueuesLock (via a victim
SchedQueuesLock) and waiting for IDLE_CONTEXTS, both
threads spin forever.
Fix: drop idle_contexts immediately after building the
wakeups Vec. The Vec is the only data we need; releasing
the lock here means steal_work on another CPU can proceed
while this CPU acquires its own SchedQueuesLock.
2. steal_work held a victim's SchedQueuesLock (victim_lock)
while calling idle_contexts(token.downgrade()).push_back
on a context that turned out to be Blocked. This is the
matching side of the deadlock: CPU A held IDLE_CONTEXTS and
waited for its own SchedQueuesLock; CPU B (steal_work) held
CPU A's SchedQueuesLock and waited for IDLE_CONTEXTS.
Fix: use idle_contexts_try (try_lock) instead of
idle_contexts (blocking lock). If IDLE_CONTEXTS is busy
(owned by wakeup_contexts on another CPU), skip the
push-back; the context will be re-checked on the next
wakeup round because it was not removed from IDLE_CONTEXTS
(the Blocked status was set, but it stayed in IDLE_CONTEXTS
because we never re-pushed it).
The original code at line 429 used idle_contexts (blocking)
which is what makes this a real deadlock. try_lock is safe
because:
- If try_lock succeeds, the context is correctly pushed
- If try_lock fails, the context is still in IDLE_CONTEXTS
(we never removed it), so the next wakeup_contexts will
find it again
Kernel
Redox OS Microkernel
Requirements
nasmneeds to be available on the PATH at build time.
Building The Documentation
Use this command:
cargo doc --open --target x86_64-unknown-none
Debugging
QEMU
Running QEMU with the -s flag will set up QEMU to listen on port 1234 for a GDB client to connect to it. To debug the redox kernel run.
make qemu gdb=yes
This will start a virtual machine with and listen on port 1234 for a GDB or LLDB client.
GDB
If you are going to use GDB, run these commands to load debug symbols and connect to your running kernel:
(gdb) symbol-file build/kernel.sym
(gdb) target remote localhost:1234
LLDB
If you are going to use LLDB, run these commands to start debugging:
(lldb) target create -s build/kernel.sym build/kernel
(lldb) gdb-remote localhost:1234
After connecting to your kernel you can set some interesting breakpoints and continue
the process. See your debuggers man page for more information on useful commands to run.
Notes
-
Always use
foo.get(n)instead offoo[n]and try to cover for the possibility ofOption::None. Doing the regular way may work fine for applications, but never in the kernel. No possible panics should ever exist in kernel space, because then the whole OS would just stop working. -
If you receive a kernel panic in QEMU, use
pkill qemu-systemto kill the frozen QEMU process.
How To Contribute
To learn how to contribute to this system component you need to read the following document:
Development
To learn how to do development with this system component inside the Redox build system you need to read the Build System and Coding and Building pages.
How To Build
To build this system component you need to download the Redox build system, you can learn how to do it on the Building Redox page.
This is necessary because they only work with cross-compilation to a Redox virtual machine, but you can do some testing from Linux.
Funding - Unix-style Signals and Process Management
This project is funded through NGI Zero Core, a fund established by NLnet with financial support from the European Commission's Next Generation Internet program. Learn more at the NLnet project page.
