feat: token streaming for orchestrator final response

Switches the orchestrator's final response from a fire-and-wait model to a
live SSE stream so text appears token-by-token as the model generates it.

- llm_client: complete() gains token_sink param; anthropic_api backend uses
  client.messages.stream(); local backend uses httpx SSE streaming; non-streaming
  backends (claude_cli, gemini_cli) emit the full text as one chunk
- orchestrator_engine + openai_orchestrator: token_sink threaded through run(),
  _run_from_contents(), _claude_handoff(), and _run_from_messages()
- routers/orchestrator: each job gets an asyncio.Queue; _on_progress and
  _token_sink write progress/token events to it; _finalize_job emits done,
  error handler emits error, confirmation gate emits confirm; new GET
  /orchestrate/{job_id}/stream SSE endpoint with 20s keepalive
- app.js: _doOrchestrate switches from 2s poll loop to EventSource; thinking
  bubble converts to a streaming message on first token; auto-scroll while
  streaming; confirm/error/done events handled; finalization unchanged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Scott Idem
2026-06-16 23:22:50 -04:00
parent c31eba111f
commit 9cb2b0d9a5
6 changed files with 293 additions and 63 deletions

View File

@@ -249,6 +249,30 @@ model costs down as sessions grow. Not continuous per-token — checkpoint-trigg
heuristic handles the worst cases. Priority rises with dev-agent pipeline work where
aider tool results can be very large.
### [UX] Token streaming for orchestrator final response ✅ — 2026-06-16
Text appears token-by-token while the model is generating, instead of waiting for the
full response after "Generating response…" completes.
- [x] **`llm_client.py`** — `complete()` gains `token_sink` param; `_dispatch()` routes to
streaming variants when set; `_anthropic_api_streaming()` uses `client.messages.stream()`;
`_local_streaming()` uses `httpx client.stream()` + SSE parsing; non-streaming backends
(claude_cli, gemini_cli) emit full text as one chunk via `token_sink`
- [x] **`orchestrator_engine.py`** — `run()`, `_run_from_contents()`, and `_claude_handoff()`
all accept and thread `token_sink`; Gemini handoff to Claude/Anthropic API is the
primary streaming path
- [x] **`openai_orchestrator.py`** — `run()` and `_run_from_messages()` accept `token_sink`;
local model final response emitted via `token_sink` (one chunk for now; true streaming
left for future polish)
- [x] **`routers/orchestrator.py`** — each job gets an `asyncio.Queue` (`_event_queue`);
`_on_progress` and `_token_sink` write to the queue as events (`{type, text}`);
`_finalize_job` emits `{type: done, ...}`, error handler emits `{type: error, ...}`,
confirmation gate emits `{type: confirm, ...}`; new `GET /orchestrate/{job_id}/stream`
SSE endpoint with 20s keepalive timeout; handles already-complete/error jobs immediately
- [x] **`static/app.js`** — `_doOrchestrate` switches from poll loop to `EventSource`; renders
thinking-bubble progress labels on `progress` events; converts bubble to streaming message
on first `token` event (with auto-scroll); handles `confirm`, `error`, `done` events;
finalization (metadata, history controls, tool calls) runs after `done`
### [Auth] Encrypted sessions
Allow users to opt-in to per-session encryption so session logs on disk cannot be
read without the user's key.