feat: token streaming for orchestrator final response

Switches the orchestrator's final response from a fire-and-wait model to a live SSE stream so text appears token-by-token as the model generates it. - llm_client: complete() gains token_sink param; anthropic_api backend uses client.messages.stream(); local backend uses httpx SSE streaming; non-streaming backends (claude_cli, gemini_cli) emit the full text as one chunk - orchestrator_engine + openai_orchestrator: token_sink threaded through run(), _run_from_contents(), _claude_handoff(), and _run_from_messages() - routers/orchestrator: each job gets an asyncio.Queue; _on_progress and _token_sink write progress/token events to it; _finalize_job emits done, error handler emits error, confirmation gate emits confirm; new GET /orchestrate/{job_id}/stream SSE endpoint with 20s keepalive - app.js: _doOrchestrate switches from 2s poll loop to EventSource; thinking bubble converts to a streaming message on first token; auto-scroll while streaming; confirm/error/done events handled; finalization unchanged Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-16 23:22:50 -04:00
parent c31eba111f
commit 9cb2b0d9a5
6 changed files with 293 additions and 63 deletions
--- a/documentation/TODO__Agents.md
+++ b/documentation/TODO__Agents.md
@@ -249,6 +249,30 @@ model costs down as sessions grow. Not continuous per-token — checkpoint-trigg
 heuristic handles the worst cases. Priority rises with dev-agent pipeline work where
 aider tool results can be very large.

+### [UX] Token streaming for orchestrator final response ✅ — 2026-06-16
+Text appears token-by-token while the model is generating, instead of waiting for the
+full response after "Generating response…" completes.
+
+- [x] **`llm_client.py`** — `complete()` gains `token_sink` param; `_dispatch()` routes to
+      streaming variants when set; `_anthropic_api_streaming()` uses `client.messages.stream()`;
+      `_local_streaming()` uses `httpx client.stream()` + SSE parsing; non-streaming backends
+      (claude_cli, gemini_cli) emit full text as one chunk via `token_sink`
+- [x] **`orchestrator_engine.py`** — `run()`, `_run_from_contents()`, and `_claude_handoff()`
+      all accept and thread `token_sink`; Gemini handoff to Claude/Anthropic API is the
+      primary streaming path
+- [x] **`openai_orchestrator.py`** — `run()` and `_run_from_messages()` accept `token_sink`;
+      local model final response emitted via `token_sink` (one chunk for now; true streaming
+      left for future polish)
+- [x] **`routers/orchestrator.py`** — each job gets an `asyncio.Queue` (`_event_queue`);
+      `_on_progress` and `_token_sink` write to the queue as events (`{type, text}`);
+      `_finalize_job` emits `{type: done, ...}`, error handler emits `{type: error, ...}`,
+      confirmation gate emits `{type: confirm, ...}`; new `GET /orchestrate/{job_id}/stream`
+      SSE endpoint with 20s keepalive timeout; handles already-complete/error jobs immediately
+- [x] **`static/app.js`** — `_doOrchestrate` switches from poll loop to `EventSource`; renders
+      thinking-bubble progress labels on `progress` events; converts bubble to streaming message
+      on first `token` event (with auto-scroll); handles `confirm`, `error`, `done` events;
+      finalization (metadata, history controls, tool calls) runs after `done`
+
 ### [Auth] Encrypted sessions
 Allow users to opt-in to per-session encryption so session logs on disk cannot be
 read without the user's key.