docs: update TODO and add BOOTSTRAP mistake #13 for API retry regression

TODO__Agents.md:
- Added the two additional fixes from the review pass to the PATCH/DELETE
  retry hardening entry: default timeout 60s→20s, and DELETE missing
  ae_auth_error banner on 401/403.

BOOTSTRAP__AI_Agent_Quickstart.md:
- Added mistake #13: breaking the API retry loop by returning errors from
  the TypeError/AbortError block instead of throwing them. Documents the
  Jan 2026 regression (commit a10accfaa), the three retry classes that must
  be preserved, and a quick verification method.
- Filled the gap at item #7 (was missing, causing off-by-one numbering
  from item 8 onward). Items renumbered 8-14 → 7-13.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Scott Idem
2026-05-21 18:21:01 -04:00
parent ea765d8ad2
commit b3029a4d27
2 changed files with 42 additions and 6 deletions

View File

@@ -300,14 +300,14 @@ These are real incidents — know them before you start.
6. **Deleting files with `rm`** — always move to `~/tmp/agents_trash`. A deleted file may 6. **Deleting files with `rm`** — always move to `~/tmp/agents_trash`. A deleted file may
contain context that's not recoverable from git if it was gitignored. contain context that's not recoverable from git if it was gitignored.
8. **Dexie `.get()` with a string object ID returns `undefined`** — Dexie `.get(value)` 7. **Dexie `.get()` with a string object ID returns `undefined`** — Dexie `.get(value)`
looks up by the table's **primary key**, which is `id` (the first schema field). The V3 looks up by the table's **primary key**, which is `id` (the first schema field). The V3
API never returns `id`, so it is always `undefined` in stored records. Passing a string API never returns `id`, so it is always `undefined` in stored records. Passing a string
object ID (e.g. `person_id`) to `.get()` will silently return nothing. Always use object ID (e.g. `person_id`) to `.get()` will silently return nothing. Always use
`.where('person_id').equals(person_id).first()` instead. This has caused liveQuery `.where('person_id').equals(person_id).first()` instead. This has caused liveQuery
blocks to always produce `undefined` even when the record exists in Dexie. blocks to always produce `undefined` even when the record exists in Dexie.
9. **Treating `$effect` blocks as auth bypass risks** — a `$effect` inside a child 8. **Treating `$effect` blocks as auth bypass risks** — a `$effect` inside a child
component cannot bypass a parent `+layout.svelte` auth gate. Children only mount if component cannot bypass a parent `+layout.svelte` auth gate. Children only mount if
the parent calls `{@render children?.()}`. Adding redundant auth guards to `$effect` the parent calls `{@render children?.()}`. Adding redundant auth guards to `$effect`
blocks that can only run after the parent gate already passed is unnecessary — and blocks that can only run after the parent gate already passed is unnecessary — and
@@ -317,13 +317,13 @@ These are real incidents — know them before you start.
clean of data loads in private modules. See `GUIDE__SvelteKit2_Svelte5_DexieJS.md` clean of data loads in private modules. See `GUIDE__SvelteKit2_Svelte5_DexieJS.md`
"SvelteKit Layout Hierarchy: Security and Execution Order" for the full explanation. "SvelteKit Layout Hierarchy: Security and Execution Order" for the full explanation.
10. **Using query `key` as a proxy for bypass stripped `x-account-id`** — this caused 9. **Using query `key` as a proxy for bypass stripped `x-account-id`** — this caused
valid account-scoped requests to lose account context and 403. `key` can be a valid valid account-scoped requests to lose account context and 403. `key` can be a valid
endpoint/business param, but it is not equivalent to `x-no-account-id: bypass`. Keep endpoint/business param, but it is not equivalent to `x-no-account-id: bypass`. Keep
`x-no-account-id` usage narrow and temporary; do not expand it without a documented `x-no-account-id` usage narrow and temporary; do not expand it without a documented
allowlist case. allowlist case.
11. **Pre-stringifying `*_json` fields before passing to API wrappers** — the API wrappers 10. **Pre-stringifying `*_json` fields before passing to API wrappers** — the API wrappers
(`api_post__crud_obj.ts` for V3, `api.ts` for legacy CRUD) automatically serialize any (`api_post__crud_obj.ts` for V3, `api.ts` for legacy CRUD) automatically serialize any
field ending in `_json` (e.g. `cfg_json`, `data_json`). Pass these as plain JS objects. field ending in `_json` (e.g. `cfg_json`, `data_json`). Pass these as plain JS objects.
Pre-stringifying with `JSON.stringify()` before calling the wrapper will double-encode Pre-stringifying with `JSON.stringify()` before calling the wrapper will double-encode
@@ -331,12 +331,12 @@ These are real incidents — know them before you start.
redundant on the V3 path. Both paths now pretty-print with 2-space indent. redundant on the V3 path. Both paths now pretty-print with 2-space indent.
See `GUIDE__AE_API_V3_for_Frontend.md` → section 3C for the full explanation. See `GUIDE__AE_API_V3_for_Frontend.md` → section 3C for the full explanation.
12. **Broad Dexie result windows get silently clipped** — if a broad "All" view shows fewer 11. **Broad Dexie result windows get silently clipped** — if a broad "All" view shows fewer
rows than a narrower filter, check for a page-level limit or an API revalidation step rows than a narrower filter, check for a page-level limit or an API revalidation step
replacing the local IDB result set. For empty text searches, the full local result set replacing the local IDB result set. For empty text searches, the full local result set
should drive the display; server refreshes should update cache, not shrink visibility. should drive the display; server refreshes should update cache, not shrink visibility.
13. **Not bumping `IDB_CONTENT_VERSIONS` when changing `properties_to_save`** — this caused 12. **Not bumping `IDB_CONTENT_VERSIONS` when changing `properties_to_save`** — this caused
the IDAA Recovery Meetings "no meetings found" bug for approximately one year (20252026). the IDAA Recovery Meetings "no meetings found" bug for approximately one year (20252026).
**What happened:** A deploy changed `properties_to_save` in `ae_events__event.ts`, but no **What happened:** A deploy changed `properties_to_save` in `ae_events__event.ts`, but no
@@ -368,6 +368,35 @@ These are real incidents — know them before you start.
0 results in your templates. Silent failures look like data problems and are extremely 0 results in your templates. Silent failures look like data problems and are extremely
difficult to diagnose. difficult to diagnose.
13. **Breaking the API retry loop by returning errors instead of throwing them** — all four
`api_*_object.ts` files (`api_get_object.ts`, `api_post_object.ts`, `api_patch_object.ts`,
`api_delete_object.ts`) use a `.catch()` that returns the error as a value, followed by a
classification block. That block **must throw** for transient network failures (`TypeError`)
so they enter the retry loop. If you change it to `return false`, retries are silently
bypassed for the most common failure mode in hotel/conference WiFi — and nothing warns you.
**What happened (commit a10accfaa, Jan 2026):** A "silence background fetch noise" commit
changed `.catch()` to explicitly `return error`, then the classification block was changed
from a `throw` to `return false`. `TypeError` from `ERR_NETWORK_CHANGED` — the most common
failure on crowded WiFi — stopped retrying. The `retry_count = 5` parameter became dead
code for network errors. Went undetected for ~4 months.
**The retry classification these files must honor:**
- `TypeError` (ERR_NETWORK_CHANGED, WiFi blip) → **`throw`** → enters retry loop with backoff
- `AbortError` where `did_timeout_abort = true` (helper's own timer) → **`throw`** → retries
- `AbortError` where `did_timeout_abort = false` (navigation/unmount abort) → `return false`
- HTTP 400/401/403/422 → `return false` immediately (client errors are deterministic)
- HTTP 5xx → **`throw`** → retries with backoff
**How to verify after any change to the error block:** confirm that a `TypeError` still
produces up to 5 retry attempts with 2s→4s→6s→8s delays before returning false. A single
`return false` after the first network failure means the retry loop is broken.
**Also:** when reviewing these files, check that all four have:
- `ae_auth_error.set()` triggered on 401/403 (shows session-expired banner to the user)
- `timeout = 20000` default (was 60s in PATCH/DELETE until 2026-05-21 — 5-min worst case)
- `did_timeout_abort` flag per attempt (separates helper timeouts from caller aborts)
--- ---
## 8. Source Layout (Quick Reference) ## 8. Source Layout (Quick Reference)

View File

@@ -233,6 +233,13 @@ including timeout abort separation and capped retry backoff.
- ✅ No regression for 400/401/403/422 fail-fast behavior. - ✅ No regression for 400/401/403/422 fail-fast behavior.
-`npx svelte-check` clean, API-focused Playwright tests remained green during rollout. -`npx svelte-check` clean, API-focused Playwright tests remained green during rollout.
**Additional fixes found during review pass (2026-05-21, commit ea765d8ad):**
- PATCH + DELETE: default timeout lowered from 60s → 20s to match GET/POST. No callers set
explicit timeouts; 60s × 5 retries = 5-minute worst case before giving up.
- DELETE: added `ae_auth_error` import and session-expired banner on 401/403. All other
files (GET/POST/PATCH) trigger the banner; DELETE was missing it, causing stale-session
deletes to silently return false with no user-visible feedback.
--- ---
### [Testing] V3 API performance probe (basic stress rounds) ### [Testing] V3 API performance probe (basic stress rounds)