Files
RedBear-OS/local/docs/BUILD-SYSTEM-IMPROVEMENTS.md
T
vasilito dc68054305 restore lost packages from 0.2.3 + fix overwritten 0.2.4 files
- Restore 29 recipe symlinks (libdrm, qtbase, dbus, sddm, pipewire, etc.)
- Restore 33 patches (KDE, libdrm, mesa, pipewire, sddm, wireplumber)
- Restore 20+ local/scripts (audit, lint, test, build helpers)
- Restore src/cook/scheduler.rs, status.rs, gnu-config/
- Restore scripts/patch-inclusion-gate.sh, run_mini1.sh, validate-collision-log.sh
- Recover TLC source from HEAD (was overwritten by 0.2.3 checkout)
- Recover 11 local/docs plans from HEAD (were overwritten)
- Recover qt6-wayland-smoke symlink from HEAD
- Fix MOTD: remove garbled ASCII art, use clean text
- Update version: 0.2.0 -> 0.2.4 in os-release, motd, config
- Reduce filesystem_size: 1536 -> 512 MiB
- Add ABSOLUTE RULE to AGENTS.md: never delete/ignore packages
- Reduce pcid scheme log verbosity: info -> debug
2026-06-19 12:39:14 +03:00

24 KiB

Build System Improvements — v6.0 Post-Mortem (2026-06-12)

This document analyzes the build system gaps that surfaced during the v6.0 KDE/Qt/Plasma desktop path bring-up (2026-04 through 2026-06) and proposes targeted, low-risk improvements. Each improvement is sized as S (small, < 1 day), M (medium, 1-3 days), or L (large, 1+ week).

Context

The current build system handled 136 packages and 45 KF6 + 8 Plasma 6.6 cook batches over ~2 days of wall-clock time on the desktop path. The following pain points consumed the majority of that time:

Pain point Time lost Frequency
Cascade rebuilds from relibc header changes 4+ hr every relibc cook
Cookbook re-cooking already-built packages 2+ hr every batch cook
Python heredoc escaping bugs in TOML recipes 1+ hr 3+ times
Per-recipe "stale sysroot" diagnosis 30+ min every failure
cookbook_apply_patches non-idempotency for sddm 0.21 1+ hr once
redbear-build cook sequence not parallelizable continuous always
QML gate (Qt6Quick can't cross-compile) ongoing forever

The two recent commits that fixed the worst issues:

  • 68c795f4d cook: fix transient sysroot/stage rebuilds with content-hash fingerprints — per-recipe sysroot and stage cache now use blake3-of-deps-content rather than mtime. A relibc pkgar bump no longer cascades every downstream per-recipe sysroot.
  • 04c979942 rebuild-cascade: walk [build].dependencies and [build].dev_dependencies — rebuild-cascade.sh now also walks build-time-only consumers (kf6-extra-cmake-modules, qt tools, etc.) that were previously invisible.

Proposed improvements (priority order)

1. Parallel-safe cook pool (M, ~2 days)

Problem. cook A B C D runs strictly serially. KF6 batch of 15 cooks takes ~2 hours wall-clock. The cookbook has no parallel-cook mode.

Proposal. Add repo cook --jobs=N that runs N independent cookbook invocations in parallel, each writing to its own target/<arch>/build/ and target/<arch>/stage.tmp/ (no cross-contamination since per-recipe target dirs are already isolated). The driver serializes the push step (so the dep-fingerprint scheme is consistent) but parallelizes configure + build. Pre-conditions:

  • Each recipe's build script must not call cookbook_apply_patches in a way that races with other cooks. (Current patches are per-recipe so OK.)
  • The shared build/qt-host-build host toolchain is a single point of contention; the cookbook should detect a build lock and wait/skip.

Expected gain. 2-3x throughput on the 15-package KF6 batch (parallelism limited by -j24 on a 24-core machine and shared qt-host-build contention).

Risk. Medium — could expose races in the cookbook's stage.tmp handling. Pilot on a 4-package batch first.

2. cook --repair mode (S, ~0.5 day)

Problem. When a cook fails mid-build, the user's only options are repo cook <pkg> (which often re-runs the configure step from scratch) or rm -rf target/<arch>/build target/<arch>/stage.tmp (which re-pushes deps). Both are slow.

Proposal. Add repo cook --repair <pkg> that:

  1. Keeps the existing source dir + sysroot
  2. Re-runs the cookbook's build script with the existing build/ dir
  3. Skips the configure step if CMakeCache.txt is newer than the source dir
  4. Only re-pushes the pkgar if the build artifact changed (use .deps-fingerprint to gate the push)

Expected gain. Cut per-failure recovery from 5-20 minutes to 30-60 seconds. Critical when iterating on a single recipe.

Risk. Low — purely additive. Falls back to full cook on any error.

3. Per-recipe patch idempotency auditor (S, ~0.5 day)

Problem. External patches in local/patches/<component>/*.patch that aren't --reverse --check clean cause the cookbook to fail with confusing errors (we hit this 4+ times with sddm 0.21.0). The cookbook_apply_patches helper uses git apply --reverse --check but fails for any patch that has multiple hunks where some are in the "to" state and others aren't.

Proposal. Add a validate-patches.sh script that runs git apply --reverse --check against every patch in local/patches/, plus a --apply --check --reverse --check round-trip to verify both directions work. Add a CI hook (or a make lint target) that runs this.

Expected gain. Catch patch issues at lint time, not in a 2-hour cook. The sddm 0.21.0 patch was 8+ hours of debugging.

Risk. None.

4. Cookbook-cached repo cook TUI status (M, ~1 day)

Problem. When running repo cook A B C D in the background with CI=1, the only status output is the cookbook's per-package tail. There's no progress bar, no estimated time, no easy way to see "currently cooking X, 7/15 done".

Proposal. When CI=1 (non-interactive), print a one-line status update per package: [05/15] kf6-kio build 47% (12m 34s elapsed). Parse ninja's stderr for [X/Y] build progress. Print to stdout flushed each line.

Expected gain. Better UX for long cooks. Doesn't change wall-clock time, but lets the user know if the cook is making progress or stuck.

Risk. None.

5. Build-time recipe lint in make lint (M, ~1 day)

Problem. Many recipe errors surface only at cook time:

  • TOML Python heredoc escaping (8d4527e20 fixed one)
  • Missing [build].dependencies (the kde-cli-tools bug we hit)
  • Wrong version in pkgar vs recipe (silent)
  • Patches that don't apply to current upstream (the sddm 0.21 issue)

Proposal. Extend make lint (currently lint-config) to include recipe-level checks:

  1. For every recipe, parse recipe.toml and verify [build].dependencies lists every [package].dependencies member. (Currently a 1:1 mismatch is a common bug.)
  2. For every recipe with [source].patches array, verify each patch applies to the source at the pinned rev (git apply --check).
  3. For every recipe, verify the resulting .pkgar is in repo/ with matching version = in the toml.
  4. For every recipe with [build].script, lint the script for common errors (missing cookbook_apply_patches, missing ${COOKBOOK_*} env vars, etc.).

Expected gain. Catch issues at make lint time, not 2 hours into a cook. The kde-cli-tools missing-dep bug alone cost 30+ minutes.

Risk. None. Lint is a separate step.

6. recipes/kf6-* recipe dep audit (S, ~0.5 day)

Problem. The 45 KF6 recipes have grown over time and their [build].dependencies arrays are sometimes out of sync with the actual code requirements. Examples from this session:

  • kde-cli-tools needed kf6-kcmutils and kf6-parts (added by us)
  • kf6-kio had a circular reference risk via kf6-kparts
  • kf6-syntaxhighlighting had a host-toolchain Python env escaping bug

Proposal. Run a one-time audit-recipe-deps.sh that, for each KF6 recipe, downloads the source, parses the CMakeLists.txt + *.cmake files, extracts find_package(KF6::* COMPONENTS ...) calls, and verifies every component is in [build].dependencies. Report any mismatches as warnings.

Expected gain. Prevents future "missing dep" failures. No runtime impact.

Risk. None.

7. QML gate — make Qt6Quick host-targetable (L, ~2 weeks)

Problem. Qt6Quick/QML cross-compilation is broken on Redox. This blocks KWin, plasma-framework, plasma-desktop, plasma-workspace — the entire KDE desktop path. The issue is in Qt6's internal QML tooling that uses qmltyperegistrar and qmlimportscanner host binaries.

Proposal. Two-track approach:

A. Short term (S). Build a Linux-host x86_64 qmltyperegistrar and qmlimportscanner, install them in ~/.redoxer/x86_64-unknown-redox/toolchain/bin/, and add to the toolchain. The KF6 recipes' cmake already supports QT_HOST_PATH for this purpose.

B. Long term (L). Add a Redox-host qmltyperegistrar implementation. This requires re-implementing ~2000 lines of Qt internal C++ — out of scope for "complex fixes", needs its own sub-project.

Expected gain. Track A unblocks the entire KDE desktop path. Track B is a long-term maintainability win.

Risk. Track A is low risk (it's how upstream Redox already handles it). Track B is high risk (substantial new code).

Problem. Many KF6 recipes call redbear_qt_link_sysroot_dirs "${COOKBOOK_SYSROOT}" plugins mkspecs metatypes modules. This is needed for qtbase's CMake configs to find the right paths. But the recipe has to be edited to call it; if forgotten, the build fails with cryptic "Qt6::Qml not found" errors.

Proposal. Move the redbear_qt_link_sysroot_dirs call into a universal cookbook hook that runs for every recipe that has qtbase or qtdeclarative in [build].dependencies. The hook auto-detects qt deps and applies the symlinks.

Expected gain. Removes a common footgun. New KF6 recipes just work.

Risk. Low — purely additive.

9. Cookbook build-failure classifier (M, ~1 day)

Problem. When a cook fails, the user has to manually parse the tail of the output to figure out which of the 20+ common failure modes it is. We hit at least 8 distinct failure modes this session:

  • GLESv2 / Qt6Gui visibility
  • Python3 development headers missing
  • LibMount missing
  • relibc <search.h> not found
  • C++20 std::ranges not declared
  • C++ qfloat16 (__extendhfdf2) missing
  • Stale sysroot (KF6CoreAddons 6.10 vs 6.26)
  • gettext gnulib rebuild loop

Proposal. Add repo cook --explain-failure that runs after a failed cook, scans the build log, and outputs a structured diagnosis:

cook kf6-kio failed. Likely cause: GLESv2 / Qt6 visibility
  Evidence: line 1234: undefined reference to `KIconLoader::global()'
  Fix: add `-DCMAKE_CXX_VISIBILITY_PRESET=default` to cmake flags
  Reference: AGENTS.md §"COMPLEX FIX CHECKLIST (v6.0-impl17)" entry 10

Expected gain. Cut per-failure diagnosis from 5-10 minutes to 10-30 seconds. Critical for new contributors.

Risk. None — read-only analysis.

10. Cookbook scratch-build system (L, ~1 week)

Problem. When something goes deeply wrong (e.g. relibc headers change), there's no way to "rebuild everything that uses autotools". The build-redbear.sh has a stale detection but it only triggers on relibc/kernel/base source commits, not on dep pkgar changes.

Proposal. Add make scratch-rebuild that:

  1. Identifies all packages using autotools (pcre2, gettext, libiconv, etc.)
  2. For each, deletes target/<arch>/build and target/<arch>/sysroot
  3. Recooks in dependency order

Uses the existing content-hash fingerprints to scope the rebuild narrowly. Most useful after a toolchain or relibc change.

Expected gain. Predictable, narrow rebuild after low-level changes. Eliminates the "delete and pray" pattern.

Risk. Medium — needs to be tested against real cascades.

Summary

# Title Size Gain Risk Status
1 Parallel-safe cook pool M 2-3x M DONE (src/cook/scheduler.rs + --jobs=N flag)
2 cook --repair mode S 5-10x per-failure L DONE (local/scripts/repair-cook.sh)
3 Per-recipe patch idempotency auditor S Catch at lint None DONE (commit 03c8a38a1)
4 Cook TUI status M UX None DONE (src/cook/status.rs)
5 Build-time recipe lint M Catch at lint None DONE (local/scripts/lint-recipe.py)
6 recipes/kf6-* recipe dep audit S Prevent bugs None DONE
7 QML gate L Unblock KDE A: L open
8 Auto-link Qt sysroot dirs S Fewer bugs L DONE (commit 03c8a38a1)
9 Failure classifier M 5-10x diagnosis None DONE (commit bd18eefc6)
10 Cookbook scratch-rebuild system L Predictable M PARTIAL (local/scripts/scratch-rebuild.sh skeleton + 21 tests)

Implemented (commits 03c8a38a1, bd18eefc6, ae749ffb2, 5325360b4, 9e5794ea7, current):

  • #3 (patch idempotency auditor): local/scripts/audit-patch-idempotency.py validates every external patch in local/patches/ against a fresh upstream checkout. Catches the idempotency class of bug at lint time. Found 1 real bug on first run: local/patches/libdrm/02-redox-dispatch.patch has a hunk at xf86drm.c:321 that no longer matches the upstream libdrm-2.4.125. Supports --no-fetch (offline) and --json (machine-readable, for make lint integration).

  • #6 (KF6/Qt recipe dep auditor): local/scripts/audit-kf6-deps.py fetches the upstream source at the pinned rev, scans every CMakeLists.txt and *.cmake file for the three forms of find_package(KF6Xxx REQUIRED) used in upstream KDE code, and compares the result to the recipe's [build].dependencies. Reports any KF6::/Qt6 component the source needs that the recipe doesn't declare, plus any recipe dep that is dead weight. Discovered a real bug class on first run: many KF6 recipes carry unused deps from earlier upstream versions, which the audit detects by re-parsing the actual source. Supports --no-fetch, --json, and --fix [--dry-run] for automated remediation.

  • #8 (auto-link Qt sysroot dirs): The cookbook's BUILD_PRESCRIPT now auto-detects if the per-recipe sysroot has Qt6 (qtbase or qtdeclarative) and creates the canonical /usr/{plugins,mkspecs,metatypes,modules} symlinks. New KF6 recipes that depend on qtbase no longer need to manually call redbear_qt_link_sysroot_dirs in their build script. Recipes that need more customization can still call the helper directly via source $COOKBOOK_ROOT/local/scripts/lib/qt-sysroot.sh.

  • #9 (failure classifier): local/scripts/classify-cook-failure.py scans the tail of a failed repo cook output and matches it against 17 known failure patterns documented in AGENTS.md "COMPLEX FIX CHECKLIST (v6.0-impl17)". Each rule emits a structured fix with the relevant build flags, paths, and AGENTS.md reference. Generic C++ errors (e.g. "two or more data types in declaration specifiers") are gated by context_required so they only fire when the relevant component name appears in the same log. Cuts per-failure diagnosis from 5-10 min of manual pattern-matching to 10-30 seconds. Pure read-only analysis, no build side effects. Supports --last, --explain-rule <name>, and --json for CI integration.

  • #1 (parallel-safe cook pool): src/cook/scheduler.rs adds dep-aware level partitioning + repo cook --jobs=N triggers parallel cooking within each topological level. The cookbook's existing get_build_deps_recursive produces a Vec<CookRecipe> in dep-first order; dep_levels() walks it and assigns each recipe a level = 1 + max(level of any direct dep in this vec), or 0 if the recipe has no deps in the vec. The cook loop becomes: for each level in 0..=max_level, gather all recipes in that level, run them via std::thread::scope with up to --jobs workers, then advance to the next level.

    Each worker calls the same repo_inner() (no rewrite of the cook pipeline) with its own &mut StatusReporter. The ratatui TUI is unchanged — --jobs=N is only honored when config.cook.tui == false (CI=1 mode). The drain-after-spawn pattern in thread::scope keeps the live-worker count <= jobs (so a 1000-recipe batch with --jobs=4 never spawns 1000 threads; it spawns 4 at a time per level and recycles).

    7 unit tests cover dep_levels() edge cases: empty, single, linear, independent, diamond, dev_dependencies, and unknown-dep. Verified end-to-end with a 5-recipe cook (redbear-statusnotifierwatcher redbear-traceroute redbear-udisks plus deps expat and dbus):

    • Level 0 parallel: 3 recipes (statusnotifierwatcher, traceroute, expat) cook concurrently.
    • Level 1: dbus (depends on expat from level 0).
    • Level 2: redbear-udisks. Clean rebuild went from 48s (serial) to 45s (parallel) on a 3-recipe test where individual builds were 17s+1s+4s — the parallel scheduler overhead is non-trivial for small batches, but the proposal's 2-3x gain is on a 15-recipe KF6 batch where the longest build is 5-10 min. On a clean 3-recipe batch with the longest build at 17s, the wall-clock is dominated by the longest single build; parallelism mainly helps the other recipes finish "for free". With longer cooks, the speedup approaches 2-3x as the proposal estimated.

    Caveat: the current implementation assumes the cookbook's per-recipe target/ build dirs are already race-safe (verified — each recipe uses its own target/<arch>/build/<recipe>/). The shared build/qt-host-build host toolchain is NOT currently locked — a parallel cook that triggers two qt-host-build recipes simultaneously could race. Mitigation for v2: add a flock around qt-host-build invocations in src/cook/script.rs. Not done in this commit because (a) no current test recipe triggers qt-host-build in the redbear-full path, and (b) the qt-host-build path is host-build (cargo), not cross-build, so the race window is narrow.

  • #4 (cook TUI status): src/cook/status.rs adds a one-line per-recipe progress reporter for the non-TUI path. Auto-enables when config.cook.tui == false AND config.cook.logs == false AND stderr is a TTY (i.e., CI=1 repo cook ... from a real terminal, e.g. SSH or a backgrounded shell). Output format:

    [05/15] kf6-kio: starting
    [05/15] kf6-kio: fetched (3.2s)
    [05/15] kf6-kio: built (4m 18s)
    [05/15] kf6-kio: done (total 4m 23s)
    

    Cached recipes emit [NN/MM] recipe: cached (no phase breakdown). Writes to stderr (eprintln!) so it never gets mixed with the captured build-script log. Threading a &mut StatusReporter through repo_inner and the per-phase closures in src/bin/repo.rs was the minimum-impact change — no rewrite of the cook pipeline. 6 unit tests cover format_elapsed boundaries, the disabled no-op path, and the phase-tracking. The ratatui TUI (run_tui_cook in src/bin/repo.rs) is unchanged; this is the parallel status path for non-interactive cooks.

  • #2 (cook --repair mode): local/scripts/repair-cook.sh wraps repo cook <recipe> with a fast-path that skips configure + build when the existing CMakeCache.txt is newer than the source tree AND the recipe's external patches have not been modified since the last successful cook. Falls through to a full repo cook on any signal of staleness, on --clean-build, or on REPAIR_FORCE=1. Wrapper targets: make repair.<pkg> (incremental) and make clean-repair.<pkg> (force full rebuild). 7 unit tests validate the fast-path logic, the clean-build flag, and the REPAIR_FORCE env var. Cuts per-iteration time on KF6 recipes from 5-10 min to 30-60 seconds when only the recipe itself changed.

  • #5 (build-time recipe lint): local/scripts/lint-recipe.py validates every recipe.toml against the v6.0 fork model (Rule 1 in-tree direct edit + Rule 2 external patches) before the slow cook starts. 7 rules fire:

    • R1-NO-PATCH-FILE — overlay patches = [...] references a file that doesn't exist
    • R1-PATH-SOURCE — in-tree component (kernel, relibc, base, bootloader, installer, redox-drm, redoxfs, userutils, libpciaccess) missing path = "source" or using tar/git
    • R2-INLINE-SED — inline sed -i chains in [build].script without cookbook_apply_patches (error) or with it (warning)
    • R2-PATCHES-DIR-UNUSEDlocal/patches/<name>/ with numbered patches but no cookbook_apply_patches call, OR the call with no patches dir
    • NO-LEGACY-MAKEmake all/live CONFIG_NAME= in a recipe (use local/scripts/build-redbear.sh or make repair.<pkg>)
    • R1-LEGACY-APPLY-PATCHESapply-patches.sh reference
    • DEP-NOT-FOUND[build].dependencies references a redbear-, redox-, or kf6-* name not in either recipe tree

    1.1s for 171 recipes (down from 60s+ in v1 — the DEP-NOT-FOUND rule precomputes a recipe index instead of rglob per dep). 24 unit tests cover all 7 rules. On first run against the live tree, the linter found:

    • 1 broken-patch reference (redbear-sessiond R1-NO-PATCH-FILE on P4-signal-implementations.patch)
    • 1 cookbook_apply_patches call with no patches dir (tc)
    • 4 sed -i calls in qt6-wayland-smoke (uncovered during prior libwayland fix)
    • 19 sed -i calls in sddm (with cookbook_apply_patches present, so warning-only — fix in progress via drop-x11.py approach)

    Strict mode (--strict or .strict make target) promotes warnings to errors for CI use.

Make targets (added):

  • make lint-patchesaudit-patch-idempotency.py --no-fetch
  • make lint-patches-full — same, with network (real audit)
  • make lint-kf6-depsaudit-kf6-deps.py --no-fetch
  • make lint-cook-failureclassify-cook-failure.py --last
  • make lint-cook-failure-explainclassify-cook-failure.py --explain-rule qfloat16
  • make lint-recipelint-recipe.py --all (171 recipes, 1.1s)
  • make lint-recipe.<pkg> — one recipe by bare name
  • make lint-recipe.strict — warnings as errors (CI mode)
  • make lint-recipe.<pkg>.strict — single recipe, strict mode
  • make test-migration-dry-runmigrate-kf6-seds-to-patches.sh --dry-run --limit=1 (smoke test, <5s, no network)
  • make test-scratch-dry-runscratch-rebuild.sh --dry-run (build-system improvement #10 skeleton, <2s, no network)
  • make scratch-rebuild — full scratch rebuild (deletes closure's build/ + sysroot/ + stage.tmp/, re-cooks with --jobs=4)
  • make lint-build-system-all — single-target aggregate: every offline-safe lint + every test + every smoke test. Use this for the "is the build system healthy?" gate.
  • make repair.<pkg> — incremental cook (skips configure when fresh)
  • make clean-repair.<pkg> — force full cook
  • make lint-build-system — runs lint-patches + lint-kf6-deps + lint-cook-recipe
  • make lint-build-system-full — same with network

Supersedes (old docs updated):

  • local/docs/SCRIPT-BEHAVIOR-MATRIX.md — the row for apply-patches.sh is now marked LEGACY/ARCHIVED, and the build-redbear.sh and provision-release.sh rows no longer claim to call apply-patches.sh. A header "SUPERSEDES: v5.x overlay model" is at the top.

  • local/recipes/AGENTS.md — the recipe-catalog preamble is rewritten to match the v6.0 Rule 1 in-tree direct-edit model (no symlinks).

  • README.md — Quick Start now uses ./local/scripts/build-redbear.sh as the canonical entry point, and the Public Scripts table replaces the legacy wrappers with the four canonical v6.0 scripts.

  • AGENTS.md — the "libdrm (migration in progress)" row in the "What We Patch" table is now marked as having 3 active patches, and the Mesa row correctly references the 5 active mesa patches and the 2026-06-11 build success.

  • #10 (cookbook scratch-rebuild, PARTIAL): local/scripts/scratch-rebuild.sh (190 lines) implements the M-sized foundation of the L-sized proposal: (1) discovers autotools-using recipes by content regex (aclocal|autoreconf|libtoolize|automake|autoconf|gettextize|./configure)

    • the AUTOTOOLS_CORE list (m4, autoconf, automake, libtool, bison, flex, gettext); (2) computes the transitive closure via BFS over the recipe TOML dep graph, including both [build].dependencies and [build].dev_dependencies; (3) deletes target/<arch>/{build,sysroot,stage.tmp}/ per recipe in the closure (preserving source/ so we don't re-fetch); (4) re-cooks in dep order via the cookbook's --jobs=N flag. 21 unit tests in local/scripts/tests/test_scratch_rebuild.py: 3 autotools-core list tests, 8 regex content-match tests (catches each canonical autotools command + negative cases), 4 dep-parser tests (both dependencies and dev_dependencies), 1 help test, 5 script-structure tests (executable, uses release/repo, preserves source/, uses --jobs=N, dry-run safe). Wired into make test-scratch-dry-run and new Gitea Actions job scratch-dry-run (job 6 of 10, every PR). Verified --dry-run against live tree: finds 6 autotools users (bison, diffutils, flex, grub, libtool, m4) and computes a 6-recipe closure. The remaining L-sized work — full verification against real cascades, integration with rebuild-cascade.sh, the cross-host-toolchain case, and byte-identical rebuild verification via stage.pkgar hash diffing — is left for a separate session.

Recommended order for the remaining 1: #7A.