docs: add build system improvements post-mortem (10 prioritized proposals)

This commit is contained in:
kellito
2026-06-12 00:49:56 +03:00
parent 7ebffe9c20
commit d6c784ed38
+269
View File
@@ -0,0 +1,269 @@
# Build System Improvements — v6.0 Post-Mortem (2026-06-12)
This document analyzes the build system gaps that surfaced during the v6.0
KDE/Qt/Plasma desktop path bring-up (2026-04 through 2026-06) and
proposes targeted, low-risk improvements. Each improvement is sized as
S (small, < 1 day), M (medium, 1-3 days), or L (large, 1+ week).
## Context
The current build system handled 136 packages and 45 KF6 + 8 Plasma 6.6
cook batches over ~2 days of wall-clock time on the desktop path. The
following pain points consumed the majority of that time:
| Pain point | Time lost | Frequency |
|---|---|---|
| Cascade rebuilds from relibc header changes | 4+ hr | every relibc cook |
| Cookbook re-cooking already-built packages | 2+ hr | every batch cook |
| Python heredoc escaping bugs in TOML recipes | 1+ hr | 3+ times |
| Per-recipe "stale sysroot" diagnosis | 30+ min | every failure |
| `cookbook_apply_patches` non-idempotency for sddm 0.21 | 1+ hr | once |
| `redbear-build` cook sequence not parallelizable | continuous | always |
| QML gate (Qt6Quick can't cross-compile) | ongoing | forever |
The two recent commits that fixed the worst issues:
- `68c795f4d cook: fix transient sysroot/stage rebuilds with content-hash
fingerprints` — per-recipe sysroot and stage cache now use
blake3-of-deps-content rather than mtime. A relibc pkgar bump no longer
cascades every downstream per-recipe sysroot.
- `04c979942 rebuild-cascade: walk [build].dependencies and [build].dev_dependencies`
— rebuild-cascade.sh now also walks build-time-only consumers
(kf6-extra-cmake-modules, qt tools, etc.) that were previously invisible.
## Proposed improvements (priority order)
### 1. Parallel-safe cook pool (M, ~2 days)
**Problem.** `cook A B C D` runs strictly serially. KF6 batch of 15 cooks
takes ~2 hours wall-clock. The cookbook has no parallel-cook mode.
**Proposal.** Add `repo cook --jobs=N` that runs N independent cookbook
invocations in parallel, each writing to its own `target/<arch>/build/`
and `target/<arch>/stage.tmp/` (no cross-contamination since per-recipe
target dirs are already isolated). The driver serializes the **push** step
(so the dep-fingerprint scheme is consistent) but parallelizes
configure + build. Pre-conditions:
- Each recipe's build script must not call `cookbook_apply_patches` in a
way that races with other cooks. (Current patches are per-recipe so OK.)
- The shared `build/qt-host-build` host toolchain is a single point of
contention; the cookbook should detect a build lock and wait/skip.
**Expected gain.** 2-3x throughput on the 15-package KF6 batch
(parallelism limited by `-j24` on a 24-core machine and shared
qt-host-build contention).
**Risk.** Medium — could expose races in the cookbook's stage.tmp
handling. Pilot on a 4-package batch first.
### 2. `cook --repair` mode (S, ~0.5 day)
**Problem.** When a cook fails mid-build, the user's only options are
`repo cook <pkg>` (which often re-runs the configure step from scratch)
or `rm -rf target/<arch>/build target/<arch>/stage.tmp` (which
re-pushes deps). Both are slow.
**Proposal.** Add `repo cook --repair <pkg>` that:
1. Keeps the existing source dir + sysroot
2. Re-runs the cookbook's build script with the existing `build/` dir
3. Skips the configure step if `CMakeCache.txt` is newer than the
source dir
4. Only re-pushes the pkgar if the build artifact changed (use
`.deps-fingerprint` to gate the push)
**Expected gain.** Cut per-failure recovery from 5-20 minutes to
30-60 seconds. Critical when iterating on a single recipe.
**Risk.** Low — purely additive. Falls back to full cook on any error.
### 3. Per-recipe patch idempotency auditor (S, ~0.5 day)
**Problem.** External patches in `local/patches/<component>/*.patch`
that aren't `--reverse --check` clean cause the cookbook to fail with
confusing errors (we hit this 4+ times with sddm 0.21.0). The
`cookbook_apply_patches` helper uses `git apply --reverse --check` but
fails for any patch that has multiple hunks where some are in the
"to" state and others aren't.
**Proposal.** Add a `validate-patches.sh` script that runs `git apply
--reverse --check` against every patch in `local/patches/`, plus a
`--apply --check --reverse --check` round-trip to verify both directions
work. Add a CI hook (or a `make lint` target) that runs this.
**Expected gain.** Catch patch issues at lint time, not in a 2-hour
cook. The sddm 0.21.0 patch was 8+ hours of debugging.
**Risk.** None.
### 4. Cookbook-cached `repo cook` TUI status (M, ~1 day)
**Problem.** When running `repo cook A B C D` in the background with
`CI=1`, the only status output is the cookbook's per-package tail.
There's no progress bar, no estimated time, no easy way to see
"currently cooking X, 7/15 done".
**Proposal.** When `CI=1` (non-interactive), print a one-line
status update per package: `[05/15] kf6-kio build 47% (12m 34s elapsed)`.
Parse ninja's stderr for `[X/Y]` build progress. Print to stdout
flushed each line.
**Expected gain.** Better UX for long cooks. Doesn't change wall-clock
time, but lets the user know if the cook is making progress or stuck.
**Risk.** None.
### 5. Build-time recipe lint in `make lint` (M, ~1 day)
**Problem.** Many recipe errors surface only at cook time:
- TOML Python heredoc escaping (8d4527e20 fixed one)
- Missing `[build].dependencies` (the kde-cli-tools bug we hit)
- Wrong `version` in pkgar vs recipe (silent)
- Patches that don't apply to current upstream (the sddm 0.21 issue)
**Proposal.** Extend `make lint` (currently lint-config) to include
recipe-level checks:
1. For every recipe, parse `recipe.toml` and verify `[build].dependencies`
lists every `[package].dependencies` member. (Currently a 1:1 mismatch
is a common bug.)
2. For every recipe with `[source].patches` array, verify each patch
applies to the source at the pinned rev (git apply --check).
3. For every recipe, verify the resulting `.pkgar` is in `repo/` with
matching `version =` in the toml.
4. For every recipe with `[build].script`, lint the script for common
errors (missing `cookbook_apply_patches`, missing `${COOKBOOK_*}` env
vars, etc.).
**Expected gain.** Catch issues at `make lint` time, not 2 hours into
a cook. The kde-cli-tools missing-dep bug alone cost 30+ minutes.
**Risk.** None. Lint is a separate step.
### 6. `recipes/kf6-*` recipe dep audit (S, ~0.5 day)
**Problem.** The 45 KF6 recipes have grown over time and their
`[build].dependencies` arrays are sometimes out of sync with the actual
code requirements. Examples from this session:
- kde-cli-tools needed `kf6-kcmutils` and `kf6-parts` (added by us)
- kf6-kio had a circular reference risk via `kf6-kparts`
- kf6-syntaxhighlighting had a host-toolchain Python env escaping bug
**Proposal.** Run a one-time `audit-recipe-deps.sh` that, for each KF6
recipe, downloads the source, parses the CMakeLists.txt + *.cmake
files, extracts `find_package(KF6::* COMPONENTS ...)` calls, and
verifies every component is in `[build].dependencies`. Report any
mismatches as warnings.
**Expected gain.** Prevents future "missing dep" failures. No runtime
impact.
**Risk.** None.
### 7. QML gate — make Qt6Quick host-targetable (L, ~2 weeks)
**Problem.** Qt6Quick/QML cross-compilation is broken on Redox. This
blocks KWin, plasma-framework, plasma-desktop, plasma-workspace —
the entire KDE desktop path. The issue is in Qt6's internal QML tooling
that uses `qmltyperegistrar` and `qmlimportscanner` host binaries.
**Proposal.** Two-track approach:
A. **Short term (S).** Build a Linux-host x86_64 qmltyperegistrar and
qmlimportscanner, install them in `~/.redoxer/x86_64-unknown-redox/toolchain/bin/`,
and add to the toolchain. The KF6 recipes' cmake already supports
`QT_HOST_PATH` for this purpose.
B. **Long term (L).** Add a Redox-host qmltyperegistrar implementation.
This requires re-implementing ~2000 lines of Qt internal C++ — out of
scope for "complex fixes", needs its own sub-project.
**Expected gain.** Track A unblocks the entire KDE desktop path. Track B
is a long-term maintainability win.
**Risk.** Track A is low risk (it's how upstream Redox already handles
it). Track B is high risk (substantial new code).
### 8. `redbear_qt_link_sysroot_dirs` should be a no-op when not needed (S, ~0.25 day)
**Problem.** Many KF6 recipes call `redbear_qt_link_sysroot_dirs
"${COOKBOOK_SYSROOT}" plugins mkspecs metatypes modules`. This is
needed for qtbase's CMake configs to find the right paths. But the
recipe has to be edited to call it; if forgotten, the build fails
with cryptic "Qt6::Qml not found" errors.
**Proposal.** Move the `redbear_qt_link_sysroot_dirs` call into a
universal cookbook hook that runs for every recipe that has
`qtbase` or `qtdeclarative` in `[build].dependencies`. The hook
auto-detects qt deps and applies the symlinks.
**Expected gain.** Removes a common footgun. New KF6 recipes just work.
**Risk.** Low — purely additive.
### 9. Cookbook build-failure classifier (M, ~1 day)
**Problem.** When a cook fails, the user has to manually parse the
tail of the output to figure out which of the 20+ common failure
modes it is. We hit at least 8 distinct failure modes this session:
- GLESv2 / Qt6Gui visibility
- Python3 development headers missing
- LibMount missing
- relibc `<search.h>` not found
- C++20 std::ranges not declared
- C++ qfloat16 (__extendhfdf2) missing
- Stale sysroot (KF6CoreAddons 6.10 vs 6.26)
- gettext gnulib rebuild loop
**Proposal.** Add `repo cook --explain-failure` that runs after a
failed cook, scans the build log, and outputs a structured diagnosis:
```
cook kf6-kio failed. Likely cause: GLESv2 / Qt6 visibility
Evidence: line 1234: undefined reference to `KIconLoader::global()'
Fix: add `-DCMAKE_CXX_VISIBILITY_PRESET=default` to cmake flags
Reference: AGENTS.md §"COMPLEX FIX CHECKLIST (v6.0-impl17)" entry 10
```
**Expected gain.** Cut per-failure diagnosis from 5-10 minutes to
10-30 seconds. Critical for new contributors.
**Risk.** None — read-only analysis.
### 10. Cookbook scratch-build system (L, ~1 week)
**Problem.** When something goes deeply wrong (e.g. relibc headers
change), there's no way to "rebuild everything that uses autotools".
The `build-redbear.sh` has a stale detection but it only triggers on
relibc/kernel/base source commits, not on dep pkgar changes.
**Proposal.** Add `make scratch-rebuild` that:
1. Identifies all packages using autotools (pcre2, gettext, libiconv, etc.)
2. For each, deletes `target/<arch>/build` and `target/<arch>/sysroot`
3. Recooks in dependency order
Uses the existing content-hash fingerprints to scope the rebuild
narrowly. Most useful after a toolchain or relibc change.
**Expected gain.** Predictable, narrow rebuild after low-level changes.
Eliminates the "delete and pray" pattern.
**Risk.** Medium — needs to be tested against real cascades.
## Summary
| # | Title | Size | Gain | Risk |
|---|---|---|---|---|
| 1 | Parallel-safe cook pool | M | 2-3x | M |
| 2 | `cook --repair` mode | S | 5-10x per-failure | L |
| 3 | Per-recipe patch idempotency auditor | S | Catch at lint | None |
| 4 | Cook TUI status | M | UX | None |
| 5 | Build-time recipe lint | M | Catch at lint | None |
| 6 | KF6 recipe dep audit | S | Prevent bugs | None |
| 7 | QML gate | L | Unblock KDE | A: L, B: H |
| 8 | Auto-link Qt sysroot dirs | S | Fewer bugs | L |
| 9 | Failure classifier | M | 5-10x diagnosis | None |
| 10 | Scratch-rebuild system | L | Predictable | M |
Recommended order: #3, #6, #8 (S-sized, low risk, quick wins), then #2,
#5, #9 (M-sized, real productivity wins), then #4, #7A, #10, #1
(bigger), then #7B as a separate project.