docs: update TODO and add BOOTSTRAP mistake #13 for API retry regression

TODO__Agents.md:
- Added the two additional fixes from the review pass to the PATCH/DELETE
  retry hardening entry: default timeout 60s→20s, and DELETE missing
  ae_auth_error banner on 401/403.

BOOTSTRAP__AI_Agent_Quickstart.md:
- Added mistake #13: breaking the API retry loop by returning errors from
  the TypeError/AbortError block instead of throwing them. Documents the
  Jan 2026 regression (commit a10accfaa), the three retry classes that must
  be preserved, and a quick verification method.
- Filled the gap at item #7 (was missing, causing off-by-one numbering
  from item 8 onward). Items renumbered 8-14 → 7-13.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Scott Idem
2026-05-21 18:21:01 -04:00
parent ea765d8ad2
commit b3029a4d27
2 changed files with 42 additions and 6 deletions

View File

@@ -300,14 +300,14 @@ These are real incidents — know them before you start.
6. **Deleting files with `rm`** — always move to `~/tmp/agents_trash`. A deleted file may
contain context that's not recoverable from git if it was gitignored.
8. **Dexie `.get()` with a string object ID returns `undefined`** — Dexie `.get(value)`
7. **Dexie `.get()` with a string object ID returns `undefined`** — Dexie `.get(value)`
looks up by the table's **primary key**, which is `id` (the first schema field). The V3
API never returns `id`, so it is always `undefined` in stored records. Passing a string
object ID (e.g. `person_id`) to `.get()` will silently return nothing. Always use
`.where('person_id').equals(person_id).first()` instead. This has caused liveQuery
blocks to always produce `undefined` even when the record exists in Dexie.
9. **Treating `$effect` blocks as auth bypass risks** — a `$effect` inside a child
8. **Treating `$effect` blocks as auth bypass risks** — a `$effect` inside a child
component cannot bypass a parent `+layout.svelte` auth gate. Children only mount if
the parent calls `{@render children?.()}`. Adding redundant auth guards to `$effect`
blocks that can only run after the parent gate already passed is unnecessary — and
@@ -317,13 +317,13 @@ These are real incidents — know them before you start.
clean of data loads in private modules. See `GUIDE__SvelteKit2_Svelte5_DexieJS.md`
"SvelteKit Layout Hierarchy: Security and Execution Order" for the full explanation.
10. **Using query `key` as a proxy for bypass stripped `x-account-id`** — this caused
9. **Using query `key` as a proxy for bypass stripped `x-account-id`** — this caused
valid account-scoped requests to lose account context and 403. `key` can be a valid
endpoint/business param, but it is not equivalent to `x-no-account-id: bypass`. Keep
`x-no-account-id` usage narrow and temporary; do not expand it without a documented
allowlist case.
11. **Pre-stringifying `*_json` fields before passing to API wrappers** — the API wrappers
10. **Pre-stringifying `*_json` fields before passing to API wrappers** — the API wrappers
(`api_post__crud_obj.ts` for V3, `api.ts` for legacy CRUD) automatically serialize any
field ending in `_json` (e.g. `cfg_json`, `data_json`). Pass these as plain JS objects.
Pre-stringifying with `JSON.stringify()` before calling the wrapper will double-encode
@@ -331,12 +331,12 @@ These are real incidents — know them before you start.
redundant on the V3 path. Both paths now pretty-print with 2-space indent.
See `GUIDE__AE_API_V3_for_Frontend.md` → section 3C for the full explanation.
12. **Broad Dexie result windows get silently clipped** — if a broad "All" view shows fewer
11. **Broad Dexie result windows get silently clipped** — if a broad "All" view shows fewer
rows than a narrower filter, check for a page-level limit or an API revalidation step
replacing the local IDB result set. For empty text searches, the full local result set
should drive the display; server refreshes should update cache, not shrink visibility.
13. **Not bumping `IDB_CONTENT_VERSIONS` when changing `properties_to_save`** — this caused
12. **Not bumping `IDB_CONTENT_VERSIONS` when changing `properties_to_save`** — this caused
the IDAA Recovery Meetings "no meetings found" bug for approximately one year (20252026).
**What happened:** A deploy changed `properties_to_save` in `ae_events__event.ts`, but no
@@ -368,6 +368,35 @@ These are real incidents — know them before you start.
0 results in your templates. Silent failures look like data problems and are extremely
difficult to diagnose.
13. **Breaking the API retry loop by returning errors instead of throwing them** — all four
`api_*_object.ts` files (`api_get_object.ts`, `api_post_object.ts`, `api_patch_object.ts`,
`api_delete_object.ts`) use a `.catch()` that returns the error as a value, followed by a
classification block. That block **must throw** for transient network failures (`TypeError`)
so they enter the retry loop. If you change it to `return false`, retries are silently
bypassed for the most common failure mode in hotel/conference WiFi — and nothing warns you.
**What happened (commit a10accfaa, Jan 2026):** A "silence background fetch noise" commit
changed `.catch()` to explicitly `return error`, then the classification block was changed
from a `throw` to `return false`. `TypeError` from `ERR_NETWORK_CHANGED` — the most common
failure on crowded WiFi — stopped retrying. The `retry_count = 5` parameter became dead
code for network errors. Went undetected for ~4 months.
**The retry classification these files must honor:**
- `TypeError` (ERR_NETWORK_CHANGED, WiFi blip) → **`throw`** → enters retry loop with backoff
- `AbortError` where `did_timeout_abort = true` (helper's own timer) → **`throw`** → retries
- `AbortError` where `did_timeout_abort = false` (navigation/unmount abort) → `return false`
- HTTP 400/401/403/422 → `return false` immediately (client errors are deterministic)
- HTTP 5xx → **`throw`** → retries with backoff
**How to verify after any change to the error block:** confirm that a `TypeError` still
produces up to 5 retry attempts with 2s→4s→6s→8s delays before returning false. A single
`return false` after the first network failure means the retry loop is broken.
**Also:** when reviewing these files, check that all four have:
- `ae_auth_error.set()` triggered on 401/403 (shows session-expired banner to the user)
- `timeout = 20000` default (was 60s in PATCH/DELETE until 2026-05-21 — 5-min worst case)
- `did_timeout_abort` flag per attempt (separates helper timeouts from caller aborts)
---
## 8. Source Layout (Quick Reference)

View File

@@ -233,6 +233,13 @@ including timeout abort separation and capped retry backoff.
- ✅ No regression for 400/401/403/422 fail-fast behavior.
-`npx svelte-check` clean, API-focused Playwright tests remained green during rollout.
**Additional fixes found during review pass (2026-05-21, commit ea765d8ad):**
- PATCH + DELETE: default timeout lowered from 60s → 20s to match GET/POST. No callers set
explicit timeouts; 60s × 5 retries = 5-minute worst case before giving up.
- DELETE: added `ae_auth_error` import and session-expired banner on 401/403. All other
files (GET/POST/PATCH) trigger the banner; DELETE was missing it, causing stale-session
deletes to silently return false with no user-visible feedback.
---
### [Testing] V3 API performance probe (basic stress rounds)