feat: janitor role — session checkpoint compaction
New cortex/janitor.py runs before each orchestrator dispatch. When a session exceeds 20 user turns or ~12K estimated tokens, the oldest half is summarized by the janitor role model and replaced with a compact checkpoint message. Fail-safe: always returns original history if the model call fails. Config: JANITOR_TURN_THRESHOLD, JANITOR_TOKEN_THRESHOLD in .env. Assign Gemma E4B or Haiku 4.5 to the janitor role for effectively-free compaction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -44,7 +44,7 @@ automatically. Remaining work is quality/reliability parity, not ground-up desig
|
||||
- [x] Retry logic on transient API errors (connection timeout, 429, 503) — 2026-05-09
|
||||
- `_chat_with_retry()` helper in `openai_orchestrator.py`; 3 attempts, exponential backoff (1s, 2s)
|
||||
- Retries on `APIConnectionError` and `APIStatusError` with status 429/500/502/503/504
|
||||
- [ ] Test end-to-end with Gemma 4 E4B and 26B A4B on scott_gaming
|
||||
- [x] Test end-to-end with Gemma 4 E4B and 26B A4B on scott_gaming — 2026-06-17
|
||||
- [ ] Review `ARCH__FUTURE.md` agent architecture ideas before finalising design
|
||||
- Reference: `docs/OPEN_WEBUI_API.md`, `documentation/ARCH__FUTURE.md` §1
|
||||
|
||||
@@ -211,43 +211,28 @@ Upload an image or document inline and have it flow into context.
|
||||
- [x] Text/code files read as UTF-8, injected as fenced code block in message
|
||||
- [x] Thumbnail/filename shown above sent message in UI
|
||||
|
||||
### [Intelligence] Session checkpoint compaction — "janitor" role
|
||||
Proactive in-session context pruning using a cheap/fast model to keep expensive
|
||||
model costs down as sessions grow. Not continuous per-token — checkpoint-triggered.
|
||||
### [Intelligence] Session checkpoint compaction — "janitor" role ✅ — 2026-06-17
|
||||
Proactive in-session context pruning using a cheap/fast model. Fires before each
|
||||
orchestrator run; compacts oldest half of history when either threshold is exceeded.
|
||||
|
||||
**Design:**
|
||||
- New `janitor` role in the model registry (alongside `chat`, `orchestrator`, `distill`)
|
||||
- Assign a cheap/fast model: Haiku 4.5, local Gemma E4B, or similar
|
||||
- Falls back to the `distill` role model if `janitor` is not configured
|
||||
- Trigger condition (either/or): session exceeds N turns (e.g. 20) OR estimated token
|
||||
count exceeds a threshold (e.g. 12K tokens of history)
|
||||
- On trigger: call janitor model with the oldest half of session history; ask it to
|
||||
write a compact "what we've established so far" summary block (3–8 sentences)
|
||||
- Replace the compacted turns with a single synthetic `assistant` message:
|
||||
`[Session checkpoint — {N} turns summarized]: {summary}`
|
||||
- The remaining recent turns stay untouched — only the stale prefix is replaced
|
||||
- Token estimate: count chars / 4 as a cheap heuristic; no exact tokenizer needed
|
||||
- [x] **`cortex/janitor.py`** — `maybe_checkpoint(session_id)` — loads session,
|
||||
checks `janitor_turn_threshold` (default 20) and `janitor_token_threshold`
|
||||
(default 12000 estimated tokens); finds a clean turn boundary; calls janitor
|
||||
role model with the oldest half; replaces compacted messages with a single
|
||||
`[Session checkpoint — N messages summarized via {backend}]` assistant message;
|
||||
fail-safe returns original messages if model call fails — 2026-06-17
|
||||
- [x] **`cortex/config.py`** — `janitor_turn_threshold`, `janitor_token_threshold`,
|
||||
`role_janitor` settings; `janitor` added to `defined_roles` — 2026-06-17
|
||||
- [x] **`cortex/routers/orchestrator.py`** — calls `janitor_checkpoint(session_id)`
|
||||
before dispatching to either orchestrator engine; no-op on new sessions — 2026-06-17
|
||||
- [x] **`model_registry.py`** — `janitor` already in `REQUIRED_ROLES`,
|
||||
`ROLE_DEFAULT_TOOLS` (no tools), and `_ROLE_LAST_RESORT` from earlier session
|
||||
|
||||
**Files to change:**
|
||||
- `model_registry.py` — add `janitor` to `ROLE_DEFAULT_TOOLS` (empty list — no tools)
|
||||
and to the roles UI in `settings/models`
|
||||
- `session_store.py` — add `maybe_checkpoint(session_id)` that checks turn count /
|
||||
estimated tokens and calls the janitor model if threshold is exceeded
|
||||
- `openai_orchestrator.py` — call `maybe_checkpoint()` at the start of each run,
|
||||
before building the active tool list and context
|
||||
- `orchestrator_engine.py` — same, before building the Gemini context
|
||||
- Settings UI — expose janitor turn/token thresholds as configurable values
|
||||
(default: 20 turns or 12K history tokens)
|
||||
**To configure:** assign Gemma E4B (local, free) or Haiku 4.5 to the `janitor` role
|
||||
in Settings → Model Registry. Thresholds overridable in `.env`:
|
||||
`JANITOR_TURN_THRESHOLD=15 JANITOR_TOKEN_THRESHOLD=8000`
|
||||
|
||||
**Economics:**
|
||||
- Haiku 4.5: ~$0.80/1M input — compacting 10K tokens costs ~$0.008
|
||||
- Saves 8–12K tokens on every subsequent Sonnet/Opus call in that session
|
||||
- Break-even after 1–2 expensive model calls post-checkpoint
|
||||
- Local janitor (Gemma E4B) = effectively free; ideal default when available
|
||||
|
||||
**Not needed yet** — most sessions are short enough that existing `_compact_messages()`
|
||||
heuristic handles the worst cases. Priority rises with dev-agent pipeline work where
|
||||
aider tool results can be very large.
|
||||
**Deferred:** Settings UI sliders for thresholds (low value — .env is sufficient)
|
||||
|
||||
### [UX] Token streaming for orchestrator final response ✅ — 2026-06-16
|
||||
Text appears token-by-token while the model is generating, instead of waiting for the
|
||||
|
||||
Reference in New Issue
Block a user