docs: add build system improvements post-mortem (10 prioritized proposals)

2026-06-12 00:49:56 +03:00
parent 7ebffe9c20
commit d6c784ed38
1 changed files with 269 additions and 0 deletions
@@ -0,0 +1,269 @@
+# Build System Improvements — v6.0 Post-Mortem (2026-06-12)
+
+This document analyzes the build system gaps that surfaced during the v6.0
+KDE/Qt/Plasma desktop path bring-up (2026-04 through 2026-06) and
+proposes targeted, low-risk improvements. Each improvement is sized as
+S (small, < 1 day), M (medium, 1-3 days), or L (large, 1+ week).
+
+## Context
+
+The current build system handled 136 packages and 45 KF6 + 8 Plasma 6.6
+cook batches over ~2 days of wall-clock time on the desktop path. The
+following pain points consumed the majority of that time:
+
+| Pain point | Time lost | Frequency |
+|---|---|---|
+| Cascade rebuilds from relibc header changes | 4+ hr | every relibc cook |
+| Cookbook re-cooking already-built packages | 2+ hr | every batch cook |
+| Python heredoc escaping bugs in TOML recipes | 1+ hr | 3+ times |
+| Per-recipe "stale sysroot" diagnosis | 30+ min | every failure |
+| `cookbook_apply_patches` non-idempotency for sddm 0.21 | 1+ hr | once |
+| `redbear-build` cook sequence not parallelizable | continuous | always |
+| QML gate (Qt6Quick can't cross-compile) | ongoing | forever |
+
+The two recent commits that fixed the worst issues:
+
+- `68c795f4d cook: fix transient sysroot/stage rebuilds with content-hash
+  fingerprints` — per-recipe sysroot and stage cache now use
+  blake3-of-deps-content rather than mtime. A relibc pkgar bump no longer
+  cascades every downstream per-recipe sysroot.
+- `04c979942 rebuild-cascade: walk [build].dependencies and [build].dev_dependencies`
+  — rebuild-cascade.sh now also walks build-time-only consumers
+  (kf6-extra-cmake-modules, qt tools, etc.) that were previously invisible.
+
+## Proposed improvements (priority order)
+
+### 1. Parallel-safe cook pool (M, ~2 days)
+
+**Problem.** `cook A B C D` runs strictly serially. KF6 batch of 15 cooks
+takes ~2 hours wall-clock. The cookbook has no parallel-cook mode.
+
+**Proposal.** Add `repo cook --jobs=N` that runs N independent cookbook
+invocations in parallel, each writing to its own `target/<arch>/build/`
+and `target/<arch>/stage.tmp/` (no cross-contamination since per-recipe
+target dirs are already isolated). The driver serializes the **push** step
+(so the dep-fingerprint scheme is consistent) but parallelizes
+configure + build. Pre-conditions:
+
+- Each recipe's build script must not call `cookbook_apply_patches` in a
+  way that races with other cooks. (Current patches are per-recipe so OK.)
+- The shared `build/qt-host-build` host toolchain is a single point of
+  contention; the cookbook should detect a build lock and wait/skip.
+
+**Expected gain.** 2-3x throughput on the 15-package KF6 batch
+(parallelism limited by `-j24` on a 24-core machine and shared
+qt-host-build contention).
+
+**Risk.** Medium — could expose races in the cookbook's stage.tmp
+handling. Pilot on a 4-package batch first.
+
+### 2. `cook --repair` mode (S, ~0.5 day)
+
+**Problem.** When a cook fails mid-build, the user's only options are
+`repo cook <pkg>` (which often re-runs the configure step from scratch)
+or `rm -rf target/<arch>/build target/<arch>/stage.tmp` (which
+re-pushes deps). Both are slow.
+
+**Proposal.** Add `repo cook --repair <pkg>` that:
+1. Keeps the existing source dir + sysroot
+2. Re-runs the cookbook's build script with the existing `build/` dir
+3. Skips the configure step if `CMakeCache.txt` is newer than the
+   source dir
+4. Only re-pushes the pkgar if the build artifact changed (use
+   `.deps-fingerprint` to gate the push)
+
+**Expected gain.** Cut per-failure recovery from 5-20 minutes to
+30-60 seconds. Critical when iterating on a single recipe.
+
+**Risk.** Low — purely additive. Falls back to full cook on any error.
+
+### 3. Per-recipe patch idempotency auditor (S, ~0.5 day)
+
+**Problem.** External patches in `local/patches/<component>/*.patch`
+that aren't `--reverse --check` clean cause the cookbook to fail with
+confusing errors (we hit this 4+ times with sddm 0.21.0). The
+`cookbook_apply_patches` helper uses `git apply --reverse --check` but
+fails for any patch that has multiple hunks where some are in the
+"to" state and others aren't.
+
+**Proposal.** Add a `validate-patches.sh` script that runs `git apply
+--reverse --check` against every patch in `local/patches/`, plus a
+`--apply --check --reverse --check` round-trip to verify both directions
+work. Add a CI hook (or a `make lint` target) that runs this.
+
+**Expected gain.** Catch patch issues at lint time, not in a 2-hour
+cook. The sddm 0.21.0 patch was 8+ hours of debugging.
+
+**Risk.** None.
+
+### 4. Cookbook-cached `repo cook` TUI status (M, ~1 day)
+
+**Problem.** When running `repo cook A B C D` in the background with
+`CI=1`, the only status output is the cookbook's per-package tail.
+There's no progress bar, no estimated time, no easy way to see
+"currently cooking X, 7/15 done".
+
+**Proposal.** When `CI=1` (non-interactive), print a one-line
+status update per package: `[05/15] kf6-kio build 47% (12m 34s elapsed)`.
+Parse ninja's stderr for `[X/Y]` build progress. Print to stdout
+flushed each line.
+
+**Expected gain.** Better UX for long cooks. Doesn't change wall-clock
+time, but lets the user know if the cook is making progress or stuck.
+
+**Risk.** None.
+
+### 5. Build-time recipe lint in `make lint` (M, ~1 day)
+
+**Problem.** Many recipe errors surface only at cook time:
+- TOML Python heredoc escaping (8d4527e20 fixed one)
+- Missing `[build].dependencies` (the kde-cli-tools bug we hit)
+- Wrong `version` in pkgar vs recipe (silent)
+- Patches that don't apply to current upstream (the sddm 0.21 issue)
+
+**Proposal.** Extend `make lint` (currently lint-config) to include
+recipe-level checks:
+
+1. For every recipe, parse `recipe.toml` and verify `[build].dependencies`
+   lists every `[package].dependencies` member. (Currently a 1:1 mismatch
+   is a common bug.)
+2. For every recipe with `[source].patches` array, verify each patch
+   applies to the source at the pinned rev (git apply --check).
+3. For every recipe, verify the resulting `.pkgar` is in `repo/` with
+   matching `version =` in the toml.
+4. For every recipe with `[build].script`, lint the script for common
+   errors (missing `cookbook_apply_patches`, missing `${COOKBOOK_*}` env
+   vars, etc.).
+
+**Expected gain.** Catch issues at `make lint` time, not 2 hours into
+a cook. The kde-cli-tools missing-dep bug alone cost 30+ minutes.
+
+**Risk.** None. Lint is a separate step.
+
+### 6. `recipes/kf6-*` recipe dep audit (S, ~0.5 day)
+
+**Problem.** The 45 KF6 recipes have grown over time and their
+`[build].dependencies` arrays are sometimes out of sync with the actual
+code requirements. Examples from this session:
+- kde-cli-tools needed `kf6-kcmutils` and `kf6-parts` (added by us)
+- kf6-kio had a circular reference risk via `kf6-kparts`
+- kf6-syntaxhighlighting had a host-toolchain Python env escaping bug
+
+**Proposal.** Run a one-time `audit-recipe-deps.sh` that, for each KF6
+recipe, downloads the source, parses the CMakeLists.txt + *.cmake
+files, extracts `find_package(KF6::* COMPONENTS ...)` calls, and
+verifies every component is in `[build].dependencies`. Report any
+mismatches as warnings.
+
+**Expected gain.** Prevents future "missing dep" failures. No runtime
+impact.
+
+**Risk.** None.
+
+### 7. QML gate — make Qt6Quick host-targetable (L, ~2 weeks)
+
+**Problem.** Qt6Quick/QML cross-compilation is broken on Redox. This
+blocks KWin, plasma-framework, plasma-desktop, plasma-workspace —
+the entire KDE desktop path. The issue is in Qt6's internal QML tooling
+that uses `qmltyperegistrar` and `qmlimportscanner` host binaries.
+
+**Proposal.** Two-track approach:
+
+A. **Short term (S).** Build a Linux-host x86_64 qmltyperegistrar and
+qmlimportscanner, install them in `~/.redoxer/x86_64-unknown-redox/toolchain/bin/`,
+and add to the toolchain. The KF6 recipes' cmake already supports
+`QT_HOST_PATH` for this purpose.
+
+B. **Long term (L).** Add a Redox-host qmltyperegistrar implementation.
+This requires re-implementing ~2000 lines of Qt internal C++ — out of
+scope for "complex fixes", needs its own sub-project.
+
+**Expected gain.** Track A unblocks the entire KDE desktop path. Track B
+is a long-term maintainability win.
+
+**Risk.** Track A is low risk (it's how upstream Redox already handles
+it). Track B is high risk (substantial new code).
+
+### 8. `redbear_qt_link_sysroot_dirs` should be a no-op when not needed (S, ~0.25 day)
+
+**Problem.** Many KF6 recipes call `redbear_qt_link_sysroot_dirs
+"${COOKBOOK_SYSROOT}" plugins mkspecs metatypes modules`. This is
+needed for qtbase's CMake configs to find the right paths. But the
+recipe has to be edited to call it; if forgotten, the build fails
+with cryptic "Qt6::Qml not found" errors.
+
+**Proposal.** Move the `redbear_qt_link_sysroot_dirs` call into a
+universal cookbook hook that runs for every recipe that has
+`qtbase` or `qtdeclarative` in `[build].dependencies`. The hook
+auto-detects qt deps and applies the symlinks.
+
+**Expected gain.** Removes a common footgun. New KF6 recipes just work.
+
+**Risk.** Low — purely additive.
+
+### 9. Cookbook build-failure classifier (M, ~1 day)
+
+**Problem.** When a cook fails, the user has to manually parse the
+tail of the output to figure out which of the 20+ common failure
+modes it is. We hit at least 8 distinct failure modes this session:
+- GLESv2 / Qt6Gui visibility
+- Python3 development headers missing
+- LibMount missing
+- relibc `<search.h>` not found
+- C++20 std::ranges not declared
+- C++ qfloat16 (__extendhfdf2) missing
+- Stale sysroot (KF6CoreAddons 6.10 vs 6.26)
+- gettext gnulib rebuild loop
+
+**Proposal.** Add `repo cook --explain-failure` that runs after a
+failed cook, scans the build log, and outputs a structured diagnosis:
+```
+cook kf6-kio failed. Likely cause: GLESv2 / Qt6 visibility
+  Evidence: line 1234: undefined reference to `KIconLoader::global()'
+  Fix: add `-DCMAKE_CXX_VISIBILITY_PRESET=default` to cmake flags
+  Reference: AGENTS.md §"COMPLEX FIX CHECKLIST (v6.0-impl17)" entry 10
+```
+
+**Expected gain.** Cut per-failure diagnosis from 5-10 minutes to
+10-30 seconds. Critical for new contributors.
+
+**Risk.** None — read-only analysis.
+
+### 10. Cookbook scratch-build system (L, ~1 week)
+
+**Problem.** When something goes deeply wrong (e.g. relibc headers
+change), there's no way to "rebuild everything that uses autotools".
+The `build-redbear.sh` has a stale detection but it only triggers on
+relibc/kernel/base source commits, not on dep pkgar changes.
+
+**Proposal.** Add `make scratch-rebuild` that:
+1. Identifies all packages using autotools (pcre2, gettext, libiconv, etc.)
+2. For each, deletes `target/<arch>/build` and `target/<arch>/sysroot`
+3. Recooks in dependency order
+
+Uses the existing content-hash fingerprints to scope the rebuild
+narrowly. Most useful after a toolchain or relibc change.
+
+**Expected gain.** Predictable, narrow rebuild after low-level changes.
+Eliminates the "delete and pray" pattern.
+
+**Risk.** Medium — needs to be tested against real cascades.
+
+## Summary
+
+| # | Title | Size | Gain | Risk |
+|---|---|---|---|---|
+| 1 | Parallel-safe cook pool | M | 2-3x | M |
+| 2 | `cook --repair` mode | S | 5-10x per-failure | L |
+| 3 | Per-recipe patch idempotency auditor | S | Catch at lint | None |
+| 4 | Cook TUI status | M | UX | None |
+| 5 | Build-time recipe lint | M | Catch at lint | None |
+| 6 | KF6 recipe dep audit | S | Prevent bugs | None |
+| 7 | QML gate | L | Unblock KDE | A: L, B: H |
+| 8 | Auto-link Qt sysroot dirs | S | Fewer bugs | L |
+| 9 | Failure classifier | M | 5-10x diagnosis | None |
+| 10 | Scratch-rebuild system | L | Predictable | M |
+
+Recommended order: #3, #6, #8 (S-sized, low risk, quick wins), then #2,
+#5, #9 (M-sized, real productivity wins), then #4, #7A, #10, #1
+(bigger), then #7B as a separate project.