Files
RedBear-OS/local/docs/BUILD-SYSTEM-ROBUSTNESS-PLAN.md
T
Red Bear CI 7177a263bf docs: add BUILD-SYSTEM-ROBUSTNESS-PLAN.md
Comprehensive 6-tier plan to address the 1.5h full-rebuild pathology
when making small config changes. Covers content-hash output
fingerprinting, per-crate granularity, public API surface tracking,
restat / equivalence caching, and developer-experience tools.

Synthesizes techniques from Nix, Buildroot, Yocto, GN/Ninja, Cargo,
and Bazel adapted to Red Bear OS's Rust cookbook.

Triggered by: 2-line edit to local/sources/base/Cargo.toml caused
1.5h full rebuild of redbear-mini. Root cause: cookbook tracks at
recipe granularity (one stage.pkgar for 45-member Cargo workspace)
instead of crate granularity.
2026-06-08 16:03:27 +03:00

551 lines
22 KiB
Markdown

# RED BEAR OS — BUILD SYSTEM ROBUSTNESS PLAN
**Generated**: 2026-06-08
**Trigger**: A 2-line config change (`local/sources/base/Cargo.toml` — added `[patch.crates-io]`
entry, changed one path dep to absolute) caused a full 1.5-hour rebuild of the entire OS image.
That is not normal. The build system must be made of independent packages with surgical
rebuild semantics, and the cookbook must distinguish "source changed" from "no actual output change".
## THE CORE PROBLEM
Red Bear OS's cookbook treats a Cargo workspace as a single recipe:
- `base` is **one** recipe (`recipes/core/base/recipe.toml`) but contains a **45-member Cargo
workspace** (`local/sources/base/Cargo.toml` with `members = ["audiod", ..., "drivers/pcid",
..., "drivers/graphics/driver-graphics", ...]`).
- A 1-line change to `local/sources/base/Cargo.toml` invalidates the recipe (`modified_dir_ignore_git`
walks the entire source tree).
- Cargo recompiles all 45 workspace members because the workspace config changed.
- The recipe then stages all 45 binaries into one `stage.pkgar`.
- Every package that lists `base` in its `[build] dependencies` sees a newer `.pkgar` mtime and
rebuilds.
- Result: a 2-line change rebuilds the entire OS.
This violates the "red bear custom work survives changes" principle. We need surgical rebuild
semantics: a change to a single driver should rebuild only that driver, not 45 others, and the
only downstream rebuilds should be packages that actually consume the changed driver's public
output.
## WHAT MATURE SYSTEMS DO
Synthesis of Nix, Buildroot, Yocto, Chromium GN/Ninja, Cargo, and Bazel:
| System | Granularity | Cache key | Cascade behavior |
|---|---|---|---|
| **Nix** | Per-derivation | Hash of all inputs (content-addressed) | Only downstream whose input hash changed rebuilds. Quotient hashing avoids mass rebuilds when fixed inputs change |
| **Buildroot** | Per-package (stamp file) | Stamp mtime | Manual — user must know when to cascade |
| **Yocto** | Per-task (siginfo) | Hash of all recipe variables | sstate cache; equivalence server avoids redundant rebuilds |
| **GN/Ninja** | Per-target (explicit `deps`) | mtime + `restat` + `gn analyze` | `gn analyze` prunes tree to affected targets; `public_deps` distinguish API vs implementation |
| **Cargo** | Per-unit (fingerprint) | Hash of rustc version + features + target + profile + dep fingerprints | Only units with changed fingerprints rebuild; dep-fingerprint cascade |
| **Bazel** | Per-action (declared inputs) | Hash of action inputs + command line + env | Skyframe does reverse-transitive-closure; "resurrection" reverts if rebuild produces identical output |
**The four core techniques Red Bear OS is missing:**
1. **Content-addressed outputs** (Nix, Bazel) — store by hash, not by name
2. **Per-unit fingerprints with dep cascade** (Cargo) — only rebuild units whose fingerprint changes
3. **Public vs private API boundary** (GN) — only propagate dirty when public surface changes
4. **Restat / equivalence caching** (Ninja, Yocto) — if rebuilt output is byte-identical, mark dirty as false
## TIER 1 — IMMEDIATE WINS (low effort, high impact)
### T1.1 — Content-hash `stage.pkgar` to detect "no actual change"
**Problem**: When `base` rebuilds, it produces a new `stage.pkgar` (different mtime), even if the
pkgar content is byte-identical to the previous one. Downstream sees the mtime change and
rebuilds.
**Fix**: After rebuild, compute a content hash of the new `stage.pkgar`. If it matches the
previous hash, **do not bump the mtime** (or set the new pkgar's mtime to the old one).
Downstream mtime comparison will see no change → no cascade.
**Implementation** (cookbook, `src/cook/cook_build.rs`):
```rust
// After packaging stage → stage.pkgar
let new_hash = blake3::hash(&std::fs::read(&stage_pkgar)?);
let old_hash_path = stage_dir.join("stage.pkgar.hash");
if let Ok(old_hash) = std::fs::read_to_string(&old_hash_path) {
if old_hash.trim() == new_hash.to_hex().to_string() {
// Content unchanged — preserve mtime, skip cascade
preserve_mtime(&stage_pkgar, &old_pkgar)?;
return Ok(());
}
}
std::fs::write(old_hash_path, new_hash.to_hex().to_string())?;
```
**Impact**: A config change that doesn't affect output (e.g., adding a comment, reordering
members) will no longer cascade.
**Effort**: 1 day (cookbook + recipe-side `blake3` dep if not present).
### T1.2 — `repo cook --since=<git-ref>` incremental mode
**Problem**: `repo cook` rebuilds everything that's "dirty" by mtime. For a developer iterating
on a single file, this can be over-inclusive.
**Fix**: Add `--since=<ref>` flag that uses `git diff --name-only <ref>..HEAD` to find changed
files, then walks the reverse dep graph to find affected recipes.
**Implementation**: New `src/cook/cook_incremental.rs`:
```rust
// 1. git diff --name-only <ref>..HEAD → list of changed files
// 2. For each changed file, find recipes whose source contains it
// 3. Build reverse dep graph (BFS)
// 4. Build root-first, then dependents
```
**Impact**: A 1-line change in one file rebuilds only that file's recipe + cascade, not the
whole source-modified set.
**Effort**: 3-5 days (git plumbing + BFS + integration with build-redbear.sh).
### T1.3 — Fix cascade script to use Cargo workspace member detection
**Problem**: `local/scripts/rebuild-cascade.sh` uses text grep
(`grep -q "dependencies.*=.*\[.*${target}.*\]"`) which misses:
- Cargo workspace member-to-member dependencies (e.g., `pcid` and `pcid-spawner` in same workspace)
- `dev-dependencies`
- Conditional dependencies behind features
**Fix**: Augment with Cargo workspace member parsing. For each recipe, if it's a Cargo recipe,
parse `Cargo.toml` for `[workspace.members]` and add member-to-member edges.
**Implementation** (rebuild-cascade.sh, augment with cargo-aware pass):
```bash
# After text-grep pass, add cargo workspace members
for recipe_toml in $(find recipes/ local/recipes/ -name "recipe.toml"); do
source_dir=$(toml_get "$recipe_toml" source.path)
if [ -f "$source_dir/Cargo.toml" ]; then
# Parse workspace members
members=$(grep -A100 '^\[workspace\]' "$source_dir/Cargo.toml" | \
grep -E '^\s*"[^"]+",?\s*$' | tr -d '",' | xargs)
for member in $members; do
# Each member is a potential dependent if it has Cargo deps on the target
...
done
fi
done
```
**Impact**: Cascade detection is accurate; no missed rebuilds, no false rebuilds.
**Effort**: 2-3 days.
### T1.4 — Per-source-hash invalidation in `modified_dir_ignore_git`
**Problem**: `fs.rs:160-167` walks the ENTIRE source tree to find the newest file mtime. A
single `.swp` file or build artifact can invalidate the cache.
**Fix**: Use git tree hash for source modification detection. If the source is a git repo
(most local sources are), use `git rev-parse HEAD:./path` to get a content hash. Only when the
hash changes, mark dirty.
**Implementation** (fs.rs):
```rust
pub fn source_fingerprint(dir: &Path) -> Result<String> {
if is_git_repo(dir) {
// git rev-parse --verify HEAD -- path → only hashes tracked files
let output = Command::new("git")
.args(&["-C", dir.to_str().unwrap(), "ls-tree", "-r", "HEAD"])
.output()?;
let mut hasher = blake3::Hasher::new();
for line in output.stdout.lines() {
hasher.update(line.as_bytes());
hasher.update(b"\n");
}
Ok(hasher.finalize().to_hex().to_string())
} else {
// Fallback: hash all files
Ok(blake3::hash_dir(dir)?.to_hex().to_string())
}
}
```
**Impact**: Build artifacts (`.swp`, `target/`, `Cargo.lock` if not tracked) no longer trigger
rebuilds. Stale mtime due to touch operations no longer triggers.
**Effort**: 2 days.
## TIER 2 — PER-CRATE GRANULARITY (medium effort, very high impact)
### T2.1 — Split `base` workspace into per-binary sub-recipes
**Problem**: `base` is one recipe with 45 workspace members. A 1-line change rebuilds all 45.
**Fix**: Two options:
**Option A (simpler)**: Keep `base` as a Cargo workspace, but change the cookbook's `build()`
to track per-binary `stage.pkgar` files. Each binary gets its own pkgar; downstream depends on
specific binaries.
**Option B (cleaner)**: Split `base` into one recipe per binary. Each recipe:
- Has its own `recipe.toml` with `template = "cargo"` and a single `-p` filter
- Stages its own binary
- Other packages depend on specific binaries (e.g., `pcid-bin`, `usbhidd-bin`)
**Recommended**: Option A first (smaller diff), then migrate to Option B as cleanup.
**Implementation** (cookbook, `cook_build.rs`):
```rust
// Detect workspace members
let members = parse_workspace_members(&source_manifest)?;
for member in members {
let member_pkgar = stage_dir.join(format!("{member}.pkgar"));
// Per-member mtime + per-member hash
let member_source_dir = source_dir.join(&member);
let member_modified = modified_dir_ignore_git(&member_source_dir)?;
let member_deps = member_dependencies(&member, &source_manifest)?;
// Per-member cache check
if member_pkgar_modified < member_modified ||
member_pkgar_modified < deps_modified_for(&member_deps) {
// Rebuild this member only
cargo build -p member ...
}
}
```
**Impact**: A change to `audiod` rebuilds only `audiod`, not all 45 base binaries. A change to
`pcid` rebuilds only `pcid`. This is the single biggest win.
**Effort**: 1-2 weeks (cookbook refactor + recipe updates).
### T2.2 — Restructure kernel and mesa recipes similarly
Same pattern as T2.1 for:
- `kernel` recipe (single `kernel.elf` output, but multiple internal stages)
- `mesa` recipe (single `libGL.so` etc., but multiple internal sub-libraries)
- `llvm21` recipe (single `clang` binary, but many internal components)
**Impact**: A change to one mesa component rebuilds only that component, not the whole mesa
build (which takes 20+ minutes).
**Effort**: 1 week each.
### T2.3 — Per-binfmt_pkg output tracking
**Problem**: A recipe's `stage/` directory contains many files, all bundled into one
`stage.pkgar`. The cookbook doesn't know which file in `stage/` corresponds to which binary.
**Fix**: Add `installs = [...]` to recipes (already partially supported), and use it to track
per-output mtime.
**Implementation**: When a recipe declares `installs = ["/usr/bin/foo", "/usr/lib/libbar.so"]`,
the cookbook:
- Tracks mtime per output path
- Computes per-output hash
- Lets downstream depend on specific output paths
**Impact**: When `libdrm.so` changes but `libdrm_intel.so` doesn't, only consumers of
`libdrm.so` rebuild.
**Effort**: 1-2 weeks.
## TIER 3 — OUTPUT FINGERPRINTING (medium-high effort, very high impact)
### T3.1 — Hash the sysroot content of each recipe
**Problem**: Currently the cookbook only checks mtime of `stage.pkgar`, not its content. Two
builds that produce identical pkgar content still cascade downstream.
**Fix**: Compute BLAKE3 hash of the staged sysroot artifacts; cache it; use it as part of
the package fingerprint.
**Implementation** (`src/cook/cook_build.rs`):
```rust
// After staging files into stage/
let stage_fingerprint = compute_stage_fingerprint(&stage_dir)?;
let fp_file = stage_dir.join("stage.fingerprint");
let new_fp = blake3::hash(stage_fingerprint.as_bytes()).to_hex().to_string();
if let Ok(old_fp) = std::fs::read_to_string(&fp_file) {
if old_fp.trim() == new_fp {
// Stage contents identical — preserve mtime
preserve_mtime_recursive(&stage_dir)?;
}
}
std::fs::write(fp_file, new_fp)?;
```
**Impact**: A rebuild that produces identical output (e.g., due to deterministic compiler
output for unchanged sources) doesn't cascade.
**Effort**: 3-5 days.
### T3.2 — Cascade invalidation only when downstream input hash changes
**Problem**: Cascade currently triggers on mtime. Mtime can change without content change.
**Fix**: Instead of `stage_modified < deps_modified`, use:
`downstream_fingerprint_input < upstream_fingerprint`.
Where:
- Each recipe declares a `fingerprint_inputs = [...]` list (paths it consumes)
- On each rebuild, hash the contents of those paths
- Store the hash as part of the recipe's fingerprint
**Implementation**:
```rust
// Recipe declares:
[package]
fingerprint_inputs = ["/usr/lib/libdrm.so", "/usr/include/libdrm/drm.h"]
// Cookbook computes:
let input_fingerprint = blake3::hash_dir_contents(fingerprint_inputs)?;
```
**Impact**: When `libdrm.so` content doesn't change (e.g., internal implementation), consumers
don't rebuild.
**Effort**: 1-2 weeks.
### T3.3 — Yocto-style equivalence cache for ABI-stable rebuilds
**Problem**: When a recipe's source changes but its output is byte-identical, the recipe
rebuilds but downstream should not.
**Fix**: Implement an "equivalence cache" — a database mapping old content hash → new content
hash for ABI-equivalent outputs. When the new content hash matches an old one (within the
equivalence class), downstream is not invalidated.
**Implementation**: SQLite-backed equivalence cache at `.redbear/equivalence.db`. Keyed by
input hash + build flags; value is the set of "equivalent" output hashes.
**Impact**: Even non-deterministic builds (e.g., embedded timestamps) can be marked equivalent.
**Effort**: 2-3 weeks.
## TIER 4 — PUBLIC API TRACKING (high effort, high impact for kernel/relibc)
### T4.1 — Distinguish public headers from internal sources
**Problem**: relibc changes cascade to all C/C++ packages. But only changes to relibc's public
headers (in `local/sources/relibc/include/`) should cascade. Internal changes to
`local/sources/relibc/src/` should not.
**Fix**: Each recipe declares `public_api = [...]` — the list of paths that constitute its
public API. Only mtime/hash changes to those paths trigger cascade.
**Implementation**:
```rust
let public_api = recipe.public_api_paths();
let public_api_modified = public_api.iter()
.map(|p| modified(p))
.max()?;
if stage_modified < public_api_modified {
cascade_to_dependents();
}
```
**Impact**: A change to `relibc/src/header/errno/mod.rs` (internal) doesn't cascade.
A change to `relibc/include/sys/errno.h` (public) does.
**Effort**: 1-2 weeks per "API surface" (relibc, kernel, mesa).
### T4.2 — Track ABI via `.so` version files / SONAME bumps
**Problem**: Even if headers change, if the ABI version is unchanged, downstream can use the
new library without recompilation.
**Fix**: Parse `.so` files for SONAME. Compare SONAME between old and new build. If
SONAME unchanged, no cascade.
**Implementation**: ELF SONAME parser in cookbook. SONAME = `readelf -d *.so | grep SONAME`.
**Impact**: Relibc ABI-preserving changes don't cascade to C/C++ packages.
**Effort**: 1 week.
### T4.3 — Header dependency graph for C/C++ packages
**Problem**: A C/C++ package's includes cascade is "any header in any include path", which is
over-inclusive. The actual cascade should be "headers this file actually includes".
**Fix**: Use `gcc -M` / `clang -M` to generate per-file header dependencies. Hash the
resulting `.d` file. Cascade only when those specific headers change.
**Impact**: A change to `errno.h` doesn't cascade to packages that don't include `errno.h`.
**Effort**: 1-2 weeks.
## TIER 5 — RESTAT / OUTPUT STABILITY (medium effort, medium impact)
### T5.1 — After rebuild, check if installed files differ from previous
**Problem**: Currently, every rebuild produces a new `stage.pkgar` regardless of whether
content changed.
**Fix**: After `cargo build` and before `pkgar` packaging, diff the new sysroot against the
old sysroot. If all files are byte-identical, copy old `stage.pkgar` mtime to new files.
**Implementation**: `diff -r` or content-hash comparison.
**Impact**: Idempotent builds don't cascade.
**Effort**: 3-5 days.
### T5.2 — Idempotent packaging
**Problem**: pkgar files include timestamps, so identical content produces different pkgar
files.
**Fix**: Make pkgar packaging deterministic (sort entries, zero timestamps, fixed compression).
**Impact**: Identical content → identical pkgar → no cascade.
**Effort**: 1 week (upstream pkgar changes).
## TIER 6 — DEPENDENCY GRAPH ANALYSIS (low effort, medium impact)
### T6.1 — Add `repo graph` to show full dependency graph
**Problem**: Hard to know what rebuilds when X changes.
**Fix**: Add `repo graph` that emits the full dep graph in DOT format. Visualize with
`xdot` or similar.
**Implementation**: New `src/bin/repo_graph.rs` (or subcommand in repo.rs).
**Effort**: 2-3 days.
### T6.2 — Add `repo cook --since=<commit>` to only rebuild affected packages
**Problem**: When you `git pull` or merge a branch, you want to rebuild only what the merge
touched.
**Fix**: Use `git diff --name-only` between old HEAD and new HEAD, walk reverse deps.
**Implementation**: New `--since` flag in `repo cook`. Falls back to `--changed` for tracked
files.
**Effort**: 3-5 days.
### T6.3 — Add `repo why <pkg>` to show what triggers rebuilds
**Problem**: When `pkg` rebuilds, why? What cascaded into it?
**Fix**: Reverse-dep analysis — show the path from each changed source to the target recipe.
**Implementation**: BFS from changed source paths, through recipe deps and Cargo workspace
members, to target recipe.
**Effort**: 2-3 days.
## PRIORITIZED ROADMAP
| Tier | Effort | Impact | Risk | Priority |
|---|---|---|---|---|
| T1.1 Content-hash stage.pkgar | 1 day | High (catches all no-op rebuilds) | Low | **P0 — DO FIRST** |
| T1.4 Per-source-hash via git tree | 2 days | High (eliminates spurious dirty) | Low | **P0** |
| T1.2 `--since` flag | 3-5 days | High (developer workflow) | Medium | **P1** |
| T1.3 Cascade script cargo-aware | 2-3 days | Medium | Low | **P1** |
| T2.1 Split base per-binary | 1-2 weeks | Very high (45 → 1 rebuild) | Medium (breaking) | **P1** |
| T3.1 Sysroot fingerprint | 3-5 days | High | Low | **P1** |
| T2.2 Split kernel/mesa | 1 week each | High | Medium | **P2** |
| T3.2 Downstream-input hash | 1-2 weeks | Very high | Medium | **P2** |
| T6.1 `repo graph` | 2-3 days | Medium (devx) | Low | **P2** |
| T6.2 `--since` commit | 3-5 days | High (devx) | Low | **P2** |
| T5.1 Restat diff | 3-5 days | Medium | Low | **P3** |
| T3.3 Equivalence cache | 2-3 weeks | High | Medium (cache coherency) | **P3** |
| T4.1 Public API surface | 1-2 weeks | High (relibc) | Medium (semantics) | **P3** |
| T4.2 SONAME tracking | 1 week | Medium | Low | **P4** |
| T4.3 Header dep graph | 1-2 weeks | Medium | Medium | **P4** |
| T5.2 Idempotent pkgar | 1 week | Medium | Medium | **P4** |
| T2.3 Per-binfmt_pkg | 1-2 weeks | High | Medium | **P4** |
| T6.3 `repo why` | 2-3 days | Low (devx) | Low | **P4** |
## PHASED IMPLEMENTATION
### Phase A (1-2 weeks) — Stop the bleeding
- T1.1 content-hash stage.pkgar
- T1.4 git-tree source fingerprint
- T1.3 cargo-aware cascade
- Result: 2-line Cargo.toml change no longer cascades if output is identical
### Phase B (2-4 weeks) — Per-crate granularity
- T2.1 split base workspace (1-2 weeks)
- T2.2 split kernel/mesa (1 week each)
- T3.1 sysroot fingerprint
- T3.2 downstream-input hash cascade
- Result: A change to one driver rebuilds only that driver
### Phase C (2-4 weeks) — API surface tracking
- T4.1 public API surface for relibc + kernel
- T4.2 SONAME tracking
- T4.3 header dep graph
- T5.1 restat diff
- T5.2 idempotent pkgar
- Result: Internal implementation changes don't cascade
### Phase D (1-2 weeks) — Developer experience
- T6.1 `repo graph`
- T6.2 `--since` commit
- T6.3 `repo why`
- T1.2 `--since=<ref>` incremental
- Result: Developer can answer "what rebuilds" instantly
## METRICS & SUCCESS CRITERIA
The build system is healthy when:
| Metric | Current | Target |
|---|---|---|
| 1-line Cargo.toml change rebuilds | Full OS (1.5h) | < 5 min (only changed recipe) |
| `make` after no source change | Full OS (1.5h) | 0 sec (idempotent, no-op) |
| 1-line kernel source change | Full OS (1.5h) | < 10 min (kernel + kernel consumers) |
| 1-line relibc internal change | Full OS (1.5h) | < 5 min (relibc + 0 consumers if API unchanged) |
| `repo cook --since=v0.1.0` | Full OS | < 1 min (1-2 packages) |
| `repo why mesa` | N/A | < 1 sec (printed graph) |
## DESIGN CONSTRAINTS
These constraints are non-negotiable:
1. **Offline-first**`REPO_OFFLINE=1` must remain default. All changes must work without
network access.
2. **Determinism** — Outputs must be byte-identical for identical inputs (modulo timestamps).
3. **Backward compat** — Existing recipes must continue to work without modification.
4. **No new build dependencies** — Only use crates already in the workspace.
5. **Performance** — Fingerprint computation must be O(source) or O(staged-output), not O(n²).
6. **Durability** — Fingerprint caches must survive `make distclean` (in `local/.cache/`).
## NON-GOALS
We will NOT:
- Replace Cargo (too invasive, too risky)
- Migrate to Bazel or Nix (would require months of work)
- Add remote artifact caching (out of scope; we have local sstate)
- Rewrite the build in a different language
- Add a distributed build cluster
## WHY THIS PLAN WILL WORK
Mature systems (Nix, Cargo, Yocto) already implement these patterns. The techniques are
proven. Red Bear OS only needs to add **output fingerprinting** and **per-crate granularity**,
both of which are well-understood in the broader build systems literature.
The hardest part is **T2.1 (per-crate granularity for base)** because it requires cookbook
changes. But the rest can be implemented incrementally and tested with `--no-cache` for
correctness.
## NEXT STEPS
1. Implement T1.1 (content-hash stage.pkgar) — 1 day, low risk
2. Implement T1.4 (git-tree source fingerprint) — 2 days, low risk
3. Implement T1.3 (cargo-aware cascade) — 2-3 days, low risk
4. Test: 1-line Cargo.toml change should rebuild only base
5. Implement T2.1 (per-binary base) — 1-2 weeks
6. Test: 1-line pcid source change should rebuild only pcid
7. Implement T3.1, T3.2 (output fingerprinting + cascade-by-hash)
8. Test: rebuild with identical source produces no cascade
9. Phase D devx improvements (graph, why, since)
## REFERENCES
- Nix Quotient Hashing: <https://nix.dev/manual/nix/2.34/store/derivation/outputs/content-address>
- Cargo Fingerprint Module: <https://doc.rust-lang.org/stable/nightly-rustc/cargo/core/compiler/fingerprint/index.html>
- GN Analyze: <https://gn.googlesource.com/gn/+/HEAD/docs/reference.md>
- Yocto sstate: <https://docs.yoctoproject.org/5.0.9/overview-manual/concepts.html>
- Bazel Skyframe: <https://preview.bazel.build/reference/skyframe>
- Buildroot Rebuilding: <https://buildroot.org/downloads/manual/rebuilding-packages.txt>