iftypeset/docs/07-session-resilience.md
codex e92f1c3b93
Some checks are pending
ci / ci (push) Waiting to run
iftypeset: document CI pipeline + Playwright + font contract
2026-01-08 18:10:41 +00:00

3.8 KiB

Session Resilience (avoid “lost work” when chats reset)

Codex/chat sessions can lose conversation context when a connection drops. The filesystem does not: the durable source of truth is the repo.

This project adds a few boring mechanisms to make resuming work deterministic.

What to trust

  • Repo state: README.md, STATUS.md, docs/ are canonical.
  • CI: ./scripts/ci.sh is the fastest sanity check.
  • Artifacts: out/ contains the latest reports from CI runs.

Quick resume checklist (30 seconds)

From the repo root:

  • ./scripts/resume.sh (recommended)
  • ./scripts/audit.sh (raw)
  • ./scripts/state.sh (writes docs/SESSION_STATE.md for pasteable state)
  • ./scripts/ci.sh

If both look sane, youre back.

Multi-session coordination (avoid Codex waiting)

When running multiple Codex sessions in parallel, coordinate through the repo:

  • Task queue: docs/13-task-board.md
  • Session context: docs/SESSION_STATE.md
  • Durable restore points: docs/CHECKPOINTS.md + out/checkpoints/

Workflow:

  1. Pick a task from docs/13-task-board.md and claim it (owner + working_set).
  2. Stay inside your working_set to avoid conflicts.
  3. Run ./scripts/ci.sh.
  4. Create a restore point: ./scripts/checkpoint.sh "task <id>: <note>"
  5. Mark the task done and paste the checkpoint reference.

If a change must block others (shared files), create a temporary lock file:

  • docs/LOCK_<area>.md

Coordination contract (short)

Use this when handing off across sessions:

  1. Claim the task in docs/13-task-board.md.
  2. Declare the working set in docs/SESSION_STATE.md.
  3. Lock shared areas via docs/LOCK_<area>.md if needed.
  4. Checkpoint after changes (./scripts/checkpoint.sh "note").
  5. Close the task and paste the checkpoint reference.

“It looks rolled back” (most common disconnect trap)

If you resume a session and it looks like files “disappeared”, its almost always one of:

  • Youre looking at a clean git checkout somewhere else (remote machine / new container) and the prior work was never committed.
  • Youre looking at a stale copy outside the repo (see warning below).

What to do:

  1. Run ./scripts/audit.sh and inspect the git status --porcelain output.

  2. Check out/checkpoints/ for the latest iftypeset_checkpoint_*.tar.gz.

  3. If you need to move the work to a new machine/session, copy the latest checkpoint tarball and extract it:

    • tar -xzf out/checkpoints/iftypeset_checkpoint_<timestamp>.tar.gz -C <new_dir>

Important: checkpoints snapshot the repo tree (including untracked files). They are the durable “chat-proof” handoff mechanism even when you dont want to commit yet.

Create a checkpoint (2 minutes)

When you finish a meaningful chunk of work (new rule batches, QA changes, renderer changes), run:

  • ./scripts/checkpoint.sh "what changed"

This:

  • runs CI and stores the CI JSON in out/checkpoints/
  • creates a compressed snapshot tarball in out/checkpoints/
  • appends a new entry to docs/CHECKPOINTS.md with the snapshot hash
  • writes docs/SESSION_STATE.md so the snapshot includes a pasteable “resume context”

This gives you a portable restore point even if the chat transcript is gone.

  • Push to a remote early (Forgejo/GitHub). A remote is the best anti-loss mechanism.
  • Treat STATUS.md as the “1-page truth” for what exists and whats next.
  • Dont rely on chat logs for state; copy any critical decisions into docs/.

Common “it looks rolled back” trap

If you have multiple status documents on disk, prefer the one inside the repo:

  • canonical: ai-workspace/iftypeset/docs/09-project-status.md
  • ⚠️ non-canonical copies (e.g., /root/docs/09-project-status.md) can drift and misreport counts.