diff --git a/documentation/BOOTSTRAP__AI_Agent_Quickstart.md b/documentation/BOOTSTRAP__AI_Agent_Quickstart.md index cb150c71..680b1169 100644 --- a/documentation/BOOTSTRAP__AI_Agent_Quickstart.md +++ b/documentation/BOOTSTRAP__AI_Agent_Quickstart.md @@ -300,14 +300,14 @@ These are real incidents — know them before you start. 6. **Deleting files with `rm`** — always move to `~/tmp/agents_trash`. A deleted file may contain context that's not recoverable from git if it was gitignored. -8. **Dexie `.get()` with a string object ID returns `undefined`** — Dexie `.get(value)` +7. **Dexie `.get()` with a string object ID returns `undefined`** — Dexie `.get(value)` looks up by the table's **primary key**, which is `id` (the first schema field). The V3 API never returns `id`, so it is always `undefined` in stored records. Passing a string object ID (e.g. `person_id`) to `.get()` will silently return nothing. Always use `.where('person_id').equals(person_id).first()` instead. This has caused liveQuery blocks to always produce `undefined` even when the record exists in Dexie. -9. **Treating `$effect` blocks as auth bypass risks** — a `$effect` inside a child +8. **Treating `$effect` blocks as auth bypass risks** — a `$effect` inside a child component cannot bypass a parent `+layout.svelte` auth gate. Children only mount if the parent calls `{@render children?.()}`. Adding redundant auth guards to `$effect` blocks that can only run after the parent gate already passed is unnecessary — and @@ -317,13 +317,13 @@ These are real incidents — know them before you start. clean of data loads in private modules. See `GUIDE__SvelteKit2_Svelte5_DexieJS.md` → "SvelteKit Layout Hierarchy: Security and Execution Order" for the full explanation. -10. **Using query `key` as a proxy for bypass stripped `x-account-id`** — this caused +9. **Using query `key` as a proxy for bypass stripped `x-account-id`** — this caused valid account-scoped requests to lose account context and 403. `key` can be a valid endpoint/business param, but it is not equivalent to `x-no-account-id: bypass`. Keep `x-no-account-id` usage narrow and temporary; do not expand it without a documented allowlist case. -11. **Pre-stringifying `*_json` fields before passing to API wrappers** — the API wrappers +10. **Pre-stringifying `*_json` fields before passing to API wrappers** — the API wrappers (`api_post__crud_obj.ts` for V3, `api.ts` for legacy CRUD) automatically serialize any field ending in `_json` (e.g. `cfg_json`, `data_json`). Pass these as plain JS objects. Pre-stringifying with `JSON.stringify()` before calling the wrapper will double-encode @@ -331,12 +331,12 @@ These are real incidents — know them before you start. redundant on the V3 path. Both paths now pretty-print with 2-space indent. See `GUIDE__AE_API_V3_for_Frontend.md` → section 3C for the full explanation. -12. **Broad Dexie result windows get silently clipped** — if a broad "All" view shows fewer +11. **Broad Dexie result windows get silently clipped** — if a broad "All" view shows fewer rows than a narrower filter, check for a page-level limit or an API revalidation step replacing the local IDB result set. For empty text searches, the full local result set should drive the display; server refreshes should update cache, not shrink visibility. -13. **Not bumping `IDB_CONTENT_VERSIONS` when changing `properties_to_save`** — this caused +12. **Not bumping `IDB_CONTENT_VERSIONS` when changing `properties_to_save`** — this caused the IDAA Recovery Meetings "no meetings found" bug for approximately one year (2025–2026). **What happened:** A deploy changed `properties_to_save` in `ae_events__event.ts`, but no @@ -368,6 +368,35 @@ These are real incidents — know them before you start. 0 results in your templates. Silent failures look like data problems and are extremely difficult to diagnose. +13. **Breaking the API retry loop by returning errors instead of throwing them** — all four + `api_*_object.ts` files (`api_get_object.ts`, `api_post_object.ts`, `api_patch_object.ts`, + `api_delete_object.ts`) use a `.catch()` that returns the error as a value, followed by a + classification block. That block **must throw** for transient network failures (`TypeError`) + so they enter the retry loop. If you change it to `return false`, retries are silently + bypassed for the most common failure mode in hotel/conference WiFi — and nothing warns you. + + **What happened (commit a10accfaa, Jan 2026):** A "silence background fetch noise" commit + changed `.catch()` to explicitly `return error`, then the classification block was changed + from a `throw` to `return false`. `TypeError` from `ERR_NETWORK_CHANGED` — the most common + failure on crowded WiFi — stopped retrying. The `retry_count = 5` parameter became dead + code for network errors. Went undetected for ~4 months. + + **The retry classification these files must honor:** + - `TypeError` (ERR_NETWORK_CHANGED, WiFi blip) → **`throw`** → enters retry loop with backoff + - `AbortError` where `did_timeout_abort = true` (helper's own timer) → **`throw`** → retries + - `AbortError` where `did_timeout_abort = false` (navigation/unmount abort) → `return false` + - HTTP 400/401/403/422 → `return false` immediately (client errors are deterministic) + - HTTP 5xx → **`throw`** → retries with backoff + + **How to verify after any change to the error block:** confirm that a `TypeError` still + produces up to 5 retry attempts with 2s→4s→6s→8s delays before returning false. A single + `return false` after the first network failure means the retry loop is broken. + + **Also:** when reviewing these files, check that all four have: + - `ae_auth_error.set()` triggered on 401/403 (shows session-expired banner to the user) + - `timeout = 20000` default (was 60s in PATCH/DELETE until 2026-05-21 — 5-min worst case) + - `did_timeout_abort` flag per attempt (separates helper timeouts from caller aborts) + --- ## 8. Source Layout (Quick Reference) diff --git a/documentation/TODO__Agents.md b/documentation/TODO__Agents.md index 41cccf31..0f476952 100644 --- a/documentation/TODO__Agents.md +++ b/documentation/TODO__Agents.md @@ -233,6 +233,13 @@ including timeout abort separation and capped retry backoff. - ✅ No regression for 400/401/403/422 fail-fast behavior. - ✅ `npx svelte-check` clean, API-focused Playwright tests remained green during rollout. +**Additional fixes found during review pass (2026-05-21, commit ea765d8ad):** +- PATCH + DELETE: default timeout lowered from 60s → 20s to match GET/POST. No callers set + explicit timeouts; 60s × 5 retries = 5-minute worst case before giving up. +- DELETE: added `ae_auth_error` import and session-expired banner on 401/403. All other + files (GET/POST/PATCH) trigger the banner; DELETE was missing it, causing stale-session + deletes to silently return false with no user-visible feedback. + --- ### [Testing] V3 API performance probe (basic stress rounds)