feat: token streaming for orchestrator final response

Switches the orchestrator's final response from a fire-and-wait model to a live SSE stream so text appears token-by-token as the model generates it. - llm_client: complete() gains token_sink param; anthropic_api backend uses client.messages.stream(); local backend uses httpx SSE streaming; non-streaming backends (claude_cli, gemini_cli) emit the full text as one chunk - orchestrator_engine + openai_orchestrator: token_sink threaded through run(), _run_from_contents(), _claude_handoff(), and _run_from_messages() - routers/orchestrator: each job gets an asyncio.Queue; _on_progress and _token_sink write progress/token events to it; _finalize_job emits done, error handler emits error, confirmation gate emits confirm; new GET /orchestrate/{job_id}/stream SSE endpoint with 20s keepalive - app.js: _doOrchestrate switches from 2s poll loop to EventSource; thinking bubble converts to a streaming message on first token; auto-scroll while streaming; confirm/error/done events handled; finalization unchanged Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-16 23:22:50 -04:00
parent c31eba111f
commit 9cb2b0d9a5
6 changed files with 293 additions and 63 deletions
--- a/cortex/openai_orchestrator.py
+++ b/cortex/openai_orchestrator.py
@@ -53,6 +53,7 @@ async def run(
    risk_whitelist: list[str] | None = None,
    risk_blacklist: list[str] | None = None,
    on_progress=None,  # async (str) -> None; called with live status updates
+    token_sink=None,   # async (str) -> None; called with each response token
 ) -> OrchestratorResult:
    """
    Run a tool-enabled task using an OpenAI-compatible API.
@@ -119,6 +120,7 @@ async def run(
        confirm_deny=_confirm_deny,
        starting_round=0,
        on_progress=on_progress,
+        token_sink=token_sink,
    )

    if checkpoint:
@@ -310,6 +312,7 @@ async def _run_from_messages(
    starting_round: int = 0,
    tool_list: list[str] | None = None,
    on_progress=None,
+    token_sink=None,
 ) -> tuple[str, OrchestrateCheckpoint | None]:
    """
    Run the OpenAI ReAct loop from the current messages state.
@@ -425,6 +428,8 @@ async def _run_from_messages(
            if on_progress:
                await on_progress("Generating response…")
            final_response = msg.content or ""
+            if token_sink and final_response:
+                await token_sink(final_response)
            logger.info(
                "OpenAI orchestrator done after %d round(s). Tools used: %d",
                round_num + 1, len(tool_call_log),