Files
RedBear-OS/local/docs/BUILD-SYSTEM-ROBUSTNESS-PLAN.md
T
Red Bear CI 7177a263bf docs: add BUILD-SYSTEM-ROBUSTNESS-PLAN.md
Comprehensive 6-tier plan to address the 1.5h full-rebuild pathology
when making small config changes. Covers content-hash output
fingerprinting, per-crate granularity, public API surface tracking,
restat / equivalence caching, and developer-experience tools.

Synthesizes techniques from Nix, Buildroot, Yocto, GN/Ninja, Cargo,
and Bazel adapted to Red Bear OS's Rust cookbook.

Triggered by: 2-line edit to local/sources/base/Cargo.toml caused
1.5h full rebuild of redbear-mini. Root cause: cookbook tracks at
recipe granularity (one stage.pkgar for 45-member Cargo workspace)
instead of crate granularity.
2026-06-08 16:03:27 +03:00

22 KiB

RED BEAR OS — BUILD SYSTEM ROBUSTNESS PLAN

Generated: 2026-06-08 Trigger: A 2-line config change (local/sources/base/Cargo.toml — added [patch.crates-io] entry, changed one path dep to absolute) caused a full 1.5-hour rebuild of the entire OS image. That is not normal. The build system must be made of independent packages with surgical rebuild semantics, and the cookbook must distinguish "source changed" from "no actual output change".

THE CORE PROBLEM

Red Bear OS's cookbook treats a Cargo workspace as a single recipe:

  • base is one recipe (recipes/core/base/recipe.toml) but contains a 45-member Cargo workspace (local/sources/base/Cargo.toml with members = ["audiod", ..., "drivers/pcid", ..., "drivers/graphics/driver-graphics", ...]).
  • A 1-line change to local/sources/base/Cargo.toml invalidates the recipe (modified_dir_ignore_git walks the entire source tree).
  • Cargo recompiles all 45 workspace members because the workspace config changed.
  • The recipe then stages all 45 binaries into one stage.pkgar.
  • Every package that lists base in its [build] dependencies sees a newer .pkgar mtime and rebuilds.
  • Result: a 2-line change rebuilds the entire OS.

This violates the "red bear custom work survives changes" principle. We need surgical rebuild semantics: a change to a single driver should rebuild only that driver, not 45 others, and the only downstream rebuilds should be packages that actually consume the changed driver's public output.

WHAT MATURE SYSTEMS DO

Synthesis of Nix, Buildroot, Yocto, Chromium GN/Ninja, Cargo, and Bazel:

System Granularity Cache key Cascade behavior
Nix Per-derivation Hash of all inputs (content-addressed) Only downstream whose input hash changed rebuilds. Quotient hashing avoids mass rebuilds when fixed inputs change
Buildroot Per-package (stamp file) Stamp mtime Manual — user must know when to cascade
Yocto Per-task (siginfo) Hash of all recipe variables sstate cache; equivalence server avoids redundant rebuilds
GN/Ninja Per-target (explicit deps) mtime + restat + gn analyze gn analyze prunes tree to affected targets; public_deps distinguish API vs implementation
Cargo Per-unit (fingerprint) Hash of rustc version + features + target + profile + dep fingerprints Only units with changed fingerprints rebuild; dep-fingerprint cascade
Bazel Per-action (declared inputs) Hash of action inputs + command line + env Skyframe does reverse-transitive-closure; "resurrection" reverts if rebuild produces identical output

The four core techniques Red Bear OS is missing:

  1. Content-addressed outputs (Nix, Bazel) — store by hash, not by name
  2. Per-unit fingerprints with dep cascade (Cargo) — only rebuild units whose fingerprint changes
  3. Public vs private API boundary (GN) — only propagate dirty when public surface changes
  4. Restat / equivalence caching (Ninja, Yocto) — if rebuilt output is byte-identical, mark dirty as false

TIER 1 — IMMEDIATE WINS (low effort, high impact)

T1.1 — Content-hash stage.pkgar to detect "no actual change"

Problem: When base rebuilds, it produces a new stage.pkgar (different mtime), even if the pkgar content is byte-identical to the previous one. Downstream sees the mtime change and rebuilds.

Fix: After rebuild, compute a content hash of the new stage.pkgar. If it matches the previous hash, do not bump the mtime (or set the new pkgar's mtime to the old one). Downstream mtime comparison will see no change → no cascade.

Implementation (cookbook, src/cook/cook_build.rs):

// After packaging stage → stage.pkgar
let new_hash = blake3::hash(&std::fs::read(&stage_pkgar)?);
let old_hash_path = stage_dir.join("stage.pkgar.hash");
if let Ok(old_hash) = std::fs::read_to_string(&old_hash_path) {
    if old_hash.trim() == new_hash.to_hex().to_string() {
        // Content unchanged — preserve mtime, skip cascade
        preserve_mtime(&stage_pkgar, &old_pkgar)?;
        return Ok(());
    }
}
std::fs::write(old_hash_path, new_hash.to_hex().to_string())?;

Impact: A config change that doesn't affect output (e.g., adding a comment, reordering members) will no longer cascade.

Effort: 1 day (cookbook + recipe-side blake3 dep if not present).

T1.2 — repo cook --since=<git-ref> incremental mode

Problem: repo cook rebuilds everything that's "dirty" by mtime. For a developer iterating on a single file, this can be over-inclusive.

Fix: Add --since=<ref> flag that uses git diff --name-only <ref>..HEAD to find changed files, then walks the reverse dep graph to find affected recipes.

Implementation: New src/cook/cook_incremental.rs:

// 1. git diff --name-only <ref>..HEAD → list of changed files
// 2. For each changed file, find recipes whose source contains it
// 3. Build reverse dep graph (BFS)
// 4. Build root-first, then dependents

Impact: A 1-line change in one file rebuilds only that file's recipe + cascade, not the whole source-modified set.

Effort: 3-5 days (git plumbing + BFS + integration with build-redbear.sh).

T1.3 — Fix cascade script to use Cargo workspace member detection

Problem: local/scripts/rebuild-cascade.sh uses text grep (grep -q "dependencies.*=.*\[.*${target}.*\]") which misses:

  • Cargo workspace member-to-member dependencies (e.g., pcid and pcid-spawner in same workspace)
  • dev-dependencies
  • Conditional dependencies behind features

Fix: Augment with Cargo workspace member parsing. For each recipe, if it's a Cargo recipe, parse Cargo.toml for [workspace.members] and add member-to-member edges.

Implementation (rebuild-cascade.sh, augment with cargo-aware pass):

# After text-grep pass, add cargo workspace members
for recipe_toml in $(find recipes/ local/recipes/ -name "recipe.toml"); do
    source_dir=$(toml_get "$recipe_toml" source.path)
    if [ -f "$source_dir/Cargo.toml" ]; then
        # Parse workspace members
        members=$(grep -A100 '^\[workspace\]' "$source_dir/Cargo.toml" | \
                  grep -E '^\s*"[^"]+",?\s*$' | tr -d '",' | xargs)
        for member in $members; do
            # Each member is a potential dependent if it has Cargo deps on the target
            ...
        done
    fi
done

Impact: Cascade detection is accurate; no missed rebuilds, no false rebuilds.

Effort: 2-3 days.

T1.4 — Per-source-hash invalidation in modified_dir_ignore_git

Problem: fs.rs:160-167 walks the ENTIRE source tree to find the newest file mtime. A single .swp file or build artifact can invalidate the cache.

Fix: Use git tree hash for source modification detection. If the source is a git repo (most local sources are), use git rev-parse HEAD:./path to get a content hash. Only when the hash changes, mark dirty.

Implementation (fs.rs):

pub fn source_fingerprint(dir: &Path) -> Result<String> {
    if is_git_repo(dir) {
        // git rev-parse --verify HEAD -- path → only hashes tracked files
        let output = Command::new("git")
            .args(&["-C", dir.to_str().unwrap(), "ls-tree", "-r", "HEAD"])
            .output()?;
        let mut hasher = blake3::Hasher::new();
        for line in output.stdout.lines() {
            hasher.update(line.as_bytes());
            hasher.update(b"\n");
        }
        Ok(hasher.finalize().to_hex().to_string())
    } else {
        // Fallback: hash all files
        Ok(blake3::hash_dir(dir)?.to_hex().to_string())
    }
}

Impact: Build artifacts (.swp, target/, Cargo.lock if not tracked) no longer trigger rebuilds. Stale mtime due to touch operations no longer triggers.

Effort: 2 days.

TIER 2 — PER-CRATE GRANULARITY (medium effort, very high impact)

T2.1 — Split base workspace into per-binary sub-recipes

Problem: base is one recipe with 45 workspace members. A 1-line change rebuilds all 45.

Fix: Two options:

Option A (simpler): Keep base as a Cargo workspace, but change the cookbook's build() to track per-binary stage.pkgar files. Each binary gets its own pkgar; downstream depends on specific binaries.

Option B (cleaner): Split base into one recipe per binary. Each recipe:

  • Has its own recipe.toml with template = "cargo" and a single -p filter
  • Stages its own binary
  • Other packages depend on specific binaries (e.g., pcid-bin, usbhidd-bin)

Recommended: Option A first (smaller diff), then migrate to Option B as cleanup.

Implementation (cookbook, cook_build.rs):

// Detect workspace members
let members = parse_workspace_members(&source_manifest)?;
for member in members {
    let member_pkgar = stage_dir.join(format!("{member}.pkgar"));
    // Per-member mtime + per-member hash
    let member_source_dir = source_dir.join(&member);
    let member_modified = modified_dir_ignore_git(&member_source_dir)?;
    let member_deps = member_dependencies(&member, &source_manifest)?;
    // Per-member cache check
    if member_pkgar_modified < member_modified ||
       member_pkgar_modified < deps_modified_for(&member_deps) {
        // Rebuild this member only
        cargo build -p member ...
    }
}

Impact: A change to audiod rebuilds only audiod, not all 45 base binaries. A change to pcid rebuilds only pcid. This is the single biggest win.

Effort: 1-2 weeks (cookbook refactor + recipe updates).

T2.2 — Restructure kernel and mesa recipes similarly

Same pattern as T2.1 for:

  • kernel recipe (single kernel.elf output, but multiple internal stages)
  • mesa recipe (single libGL.so etc., but multiple internal sub-libraries)
  • llvm21 recipe (single clang binary, but many internal components)

Impact: A change to one mesa component rebuilds only that component, not the whole mesa build (which takes 20+ minutes).

Effort: 1 week each.

T2.3 — Per-binfmt_pkg output tracking

Problem: A recipe's stage/ directory contains many files, all bundled into one stage.pkgar. The cookbook doesn't know which file in stage/ corresponds to which binary.

Fix: Add installs = [...] to recipes (already partially supported), and use it to track per-output mtime.

Implementation: When a recipe declares installs = ["/usr/bin/foo", "/usr/lib/libbar.so"], the cookbook:

  • Tracks mtime per output path
  • Computes per-output hash
  • Lets downstream depend on specific output paths

Impact: When libdrm.so changes but libdrm_intel.so doesn't, only consumers of libdrm.so rebuild.

Effort: 1-2 weeks.

TIER 3 — OUTPUT FINGERPRINTING (medium-high effort, very high impact)

T3.1 — Hash the sysroot content of each recipe

Problem: Currently the cookbook only checks mtime of stage.pkgar, not its content. Two builds that produce identical pkgar content still cascade downstream.

Fix: Compute BLAKE3 hash of the staged sysroot artifacts; cache it; use it as part of the package fingerprint.

Implementation (src/cook/cook_build.rs):

// After staging files into stage/
let stage_fingerprint = compute_stage_fingerprint(&stage_dir)?;
let fp_file = stage_dir.join("stage.fingerprint");
let new_fp = blake3::hash(stage_fingerprint.as_bytes()).to_hex().to_string();

if let Ok(old_fp) = std::fs::read_to_string(&fp_file) {
    if old_fp.trim() == new_fp {
        // Stage contents identical — preserve mtime
        preserve_mtime_recursive(&stage_dir)?;
    }
}
std::fs::write(fp_file, new_fp)?;

Impact: A rebuild that produces identical output (e.g., due to deterministic compiler output for unchanged sources) doesn't cascade.

Effort: 3-5 days.

T3.2 — Cascade invalidation only when downstream input hash changes

Problem: Cascade currently triggers on mtime. Mtime can change without content change.

Fix: Instead of stage_modified < deps_modified, use: downstream_fingerprint_input < upstream_fingerprint.

Where:

  • Each recipe declares a fingerprint_inputs = [...] list (paths it consumes)
  • On each rebuild, hash the contents of those paths
  • Store the hash as part of the recipe's fingerprint

Implementation:

// Recipe declares:
[package]
fingerprint_inputs = ["/usr/lib/libdrm.so", "/usr/include/libdrm/drm.h"]

// Cookbook computes:
let input_fingerprint = blake3::hash_dir_contents(fingerprint_inputs)?;

Impact: When libdrm.so content doesn't change (e.g., internal implementation), consumers don't rebuild.

Effort: 1-2 weeks.

T3.3 — Yocto-style equivalence cache for ABI-stable rebuilds

Problem: When a recipe's source changes but its output is byte-identical, the recipe rebuilds but downstream should not.

Fix: Implement an "equivalence cache" — a database mapping old content hash → new content hash for ABI-equivalent outputs. When the new content hash matches an old one (within the equivalence class), downstream is not invalidated.

Implementation: SQLite-backed equivalence cache at .redbear/equivalence.db. Keyed by input hash + build flags; value is the set of "equivalent" output hashes.

Impact: Even non-deterministic builds (e.g., embedded timestamps) can be marked equivalent.

Effort: 2-3 weeks.

TIER 4 — PUBLIC API TRACKING (high effort, high impact for kernel/relibc)

T4.1 — Distinguish public headers from internal sources

Problem: relibc changes cascade to all C/C++ packages. But only changes to relibc's public headers (in local/sources/relibc/include/) should cascade. Internal changes to local/sources/relibc/src/ should not.

Fix: Each recipe declares public_api = [...] — the list of paths that constitute its public API. Only mtime/hash changes to those paths trigger cascade.

Implementation:

let public_api = recipe.public_api_paths();
let public_api_modified = public_api.iter()
    .map(|p| modified(p))
    .max()?;

if stage_modified < public_api_modified {
    cascade_to_dependents();
}

Impact: A change to relibc/src/header/errno/mod.rs (internal) doesn't cascade. A change to relibc/include/sys/errno.h (public) does.

Effort: 1-2 weeks per "API surface" (relibc, kernel, mesa).

T4.2 — Track ABI via .so version files / SONAME bumps

Problem: Even if headers change, if the ABI version is unchanged, downstream can use the new library without recompilation.

Fix: Parse .so files for SONAME. Compare SONAME between old and new build. If SONAME unchanged, no cascade.

Implementation: ELF SONAME parser in cookbook. SONAME = readelf -d *.so | grep SONAME.

Impact: Relibc ABI-preserving changes don't cascade to C/C++ packages.

Effort: 1 week.

T4.3 — Header dependency graph for C/C++ packages

Problem: A C/C++ package's includes cascade is "any header in any include path", which is over-inclusive. The actual cascade should be "headers this file actually includes".

Fix: Use gcc -M / clang -M to generate per-file header dependencies. Hash the resulting .d file. Cascade only when those specific headers change.

Impact: A change to errno.h doesn't cascade to packages that don't include errno.h.

Effort: 1-2 weeks.

TIER 5 — RESTAT / OUTPUT STABILITY (medium effort, medium impact)

T5.1 — After rebuild, check if installed files differ from previous

Problem: Currently, every rebuild produces a new stage.pkgar regardless of whether content changed.

Fix: After cargo build and before pkgar packaging, diff the new sysroot against the old sysroot. If all files are byte-identical, copy old stage.pkgar mtime to new files.

Implementation: diff -r or content-hash comparison.

Impact: Idempotent builds don't cascade.

Effort: 3-5 days.

T5.2 — Idempotent packaging

Problem: pkgar files include timestamps, so identical content produces different pkgar files.

Fix: Make pkgar packaging deterministic (sort entries, zero timestamps, fixed compression).

Impact: Identical content → identical pkgar → no cascade.

Effort: 1 week (upstream pkgar changes).

TIER 6 — DEPENDENCY GRAPH ANALYSIS (low effort, medium impact)

T6.1 — Add repo graph to show full dependency graph

Problem: Hard to know what rebuilds when X changes.

Fix: Add repo graph that emits the full dep graph in DOT format. Visualize with xdot or similar.

Implementation: New src/bin/repo_graph.rs (or subcommand in repo.rs).

Effort: 2-3 days.

T6.2 — Add repo cook --since=<commit> to only rebuild affected packages

Problem: When you git pull or merge a branch, you want to rebuild only what the merge touched.

Fix: Use git diff --name-only between old HEAD and new HEAD, walk reverse deps.

Implementation: New --since flag in repo cook. Falls back to --changed for tracked files.

Effort: 3-5 days.

T6.3 — Add repo why <pkg> to show what triggers rebuilds

Problem: When pkg rebuilds, why? What cascaded into it?

Fix: Reverse-dep analysis — show the path from each changed source to the target recipe.

Implementation: BFS from changed source paths, through recipe deps and Cargo workspace members, to target recipe.

Effort: 2-3 days.

PRIORITIZED ROADMAP

Tier Effort Impact Risk Priority
T1.1 Content-hash stage.pkgar 1 day High (catches all no-op rebuilds) Low P0 — DO FIRST
T1.4 Per-source-hash via git tree 2 days High (eliminates spurious dirty) Low P0
T1.2 --since flag 3-5 days High (developer workflow) Medium P1
T1.3 Cascade script cargo-aware 2-3 days Medium Low P1
T2.1 Split base per-binary 1-2 weeks Very high (45 → 1 rebuild) Medium (breaking) P1
T3.1 Sysroot fingerprint 3-5 days High Low P1
T2.2 Split kernel/mesa 1 week each High Medium P2
T3.2 Downstream-input hash 1-2 weeks Very high Medium P2
T6.1 repo graph 2-3 days Medium (devx) Low P2
T6.2 --since commit 3-5 days High (devx) Low P2
T5.1 Restat diff 3-5 days Medium Low P3
T3.3 Equivalence cache 2-3 weeks High Medium (cache coherency) P3
T4.1 Public API surface 1-2 weeks High (relibc) Medium (semantics) P3
T4.2 SONAME tracking 1 week Medium Low P4
T4.3 Header dep graph 1-2 weeks Medium Medium P4
T5.2 Idempotent pkgar 1 week Medium Medium P4
T2.3 Per-binfmt_pkg 1-2 weeks High Medium P4
T6.3 repo why 2-3 days Low (devx) Low P4

PHASED IMPLEMENTATION

Phase A (1-2 weeks) — Stop the bleeding

  • T1.1 content-hash stage.pkgar
  • T1.4 git-tree source fingerprint
  • T1.3 cargo-aware cascade
  • Result: 2-line Cargo.toml change no longer cascades if output is identical

Phase B (2-4 weeks) — Per-crate granularity

  • T2.1 split base workspace (1-2 weeks)
  • T2.2 split kernel/mesa (1 week each)
  • T3.1 sysroot fingerprint
  • T3.2 downstream-input hash cascade
  • Result: A change to one driver rebuilds only that driver

Phase C (2-4 weeks) — API surface tracking

  • T4.1 public API surface for relibc + kernel
  • T4.2 SONAME tracking
  • T4.3 header dep graph
  • T5.1 restat diff
  • T5.2 idempotent pkgar
  • Result: Internal implementation changes don't cascade

Phase D (1-2 weeks) — Developer experience

  • T6.1 repo graph
  • T6.2 --since commit
  • T6.3 repo why
  • T1.2 --since=<ref> incremental
  • Result: Developer can answer "what rebuilds" instantly

METRICS & SUCCESS CRITERIA

The build system is healthy when:

Metric Current Target
1-line Cargo.toml change rebuilds Full OS (1.5h) < 5 min (only changed recipe)
make after no source change Full OS (1.5h) 0 sec (idempotent, no-op)
1-line kernel source change Full OS (1.5h) < 10 min (kernel + kernel consumers)
1-line relibc internal change Full OS (1.5h) < 5 min (relibc + 0 consumers if API unchanged)
repo cook --since=v0.1.0 Full OS < 1 min (1-2 packages)
repo why mesa N/A < 1 sec (printed graph)

DESIGN CONSTRAINTS

These constraints are non-negotiable:

  1. Offline-firstREPO_OFFLINE=1 must remain default. All changes must work without network access.
  2. Determinism — Outputs must be byte-identical for identical inputs (modulo timestamps).
  3. Backward compat — Existing recipes must continue to work without modification.
  4. No new build dependencies — Only use crates already in the workspace.
  5. Performance — Fingerprint computation must be O(source) or O(staged-output), not O(n²).
  6. Durability — Fingerprint caches must survive make distclean (in local/.cache/).

NON-GOALS

We will NOT:

  • Replace Cargo (too invasive, too risky)
  • Migrate to Bazel or Nix (would require months of work)
  • Add remote artifact caching (out of scope; we have local sstate)
  • Rewrite the build in a different language
  • Add a distributed build cluster

WHY THIS PLAN WILL WORK

Mature systems (Nix, Cargo, Yocto) already implement these patterns. The techniques are proven. Red Bear OS only needs to add output fingerprinting and per-crate granularity, both of which are well-understood in the broader build systems literature.

The hardest part is T2.1 (per-crate granularity for base) because it requires cookbook changes. But the rest can be implemented incrementally and tested with --no-cache for correctness.

NEXT STEPS

  1. Implement T1.1 (content-hash stage.pkgar) — 1 day, low risk
  2. Implement T1.4 (git-tree source fingerprint) — 2 days, low risk
  3. Implement T1.3 (cargo-aware cascade) — 2-3 days, low risk
  4. Test: 1-line Cargo.toml change should rebuild only base
  5. Implement T2.1 (per-binary base) — 1-2 weeks
  6. Test: 1-line pcid source change should rebuild only pcid
  7. Implement T3.1, T3.2 (output fingerprinting + cascade-by-hash)
  8. Test: rebuild with identical source produces no cascade
  9. Phase D devx improvements (graph, why, since)

REFERENCES