feat: OpenAI-compatible orchestrator + backend auto-routing

- openai_orchestrator.py — new ReAct tool loop engine for any OpenAI-compatible endpoint (OpenRouter, Open WebUI, Ollama, LiteLLM); model handles both tool loop and final response, no Claude handoff needed - tools/__init__.py — auto-derive OpenAI JSON Schema from existing Gemini FunctionDeclarations so tool definitions have a single source of truth - routers/orchestrator.py — route to openai_orchestrator when model registry "orchestrator" role resolves to a local_openai type host - routers/chat.py — pass role to _backend_label(); fix fallback_used logic (only meaningful for explicit backend overrides, not auto-routing) - static/app.js — add null/"auto" to backend cycle; fetch local model hint without overriding the auto default on page load - model_registry.py — _normalize() back-fills host_type on old registry files - requirements.txt — add openai>=1.0.0 - ARCH__BACKENDS.md — document OpenAI-compat backend and routing logic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 19:18:18 -04:00
parent 8ba5247ef5
commit d9a322164a
9 changed files with 538 additions and 112 deletions
--- a/documentation/ARCH__BACKENDS.md
+++ b/documentation/ARCH__BACKENDS.md
@@ -1,47 +1,130 @@
 # Architecture: LLM Backends

-> How Cortex talks to AI models.
-> Last updated: 2026-04-03
+> How Cortex selects and talks to AI models.
+> Last updated: 2026-04-06

 ---

-## Three Backends
+## Backends

-| Backend | Used for | Auth | Config |
+| Backend | Type | Auth | Notes |
 |---|---|---|---|
-| **Claude CLI** | Primary chat, all user-facing responses | OAuth token from `~/.claude/.credentials.json` | `DEFAULT_MODEL` in `.env` |
-| **Gemini CLI** | Fallback when Claude unavailable | Gemini CLI credentials | Auto-fallback |
-| **Local (Open WebUI)** | Private/offline tasks, cost-free use | API key per user in `local_llm.json` | `/settings/local` UI |
-
-The **Gemini API** (google-genai SDK) is also used — but only by the orchestrator tool loop, not as a general chat backend. See [`ARCH__FUTURE.md`](ARCH__FUTURE.md) for the orchestrator pattern.
+| **Claude CLI** | `claude_cli` | OAuth token from `~/.claude/.credentials.json` | Primary chat; model set via `DEFAULT_MODEL` in `.env` |
+| **Gemini CLI** | `gemini_cli` | Gemini CLI credentials | Fallback / explicit selection |
+| **Gemini API** | `gemini_api` | `GEMINI_API_KEY` in `.env` | Orchestrator tool loop only — not general chat |
+| **Local (OpenAI-compat)** | `local_openai` | API key per host in model registry | Open WebUI, Ollama, OpenRouter, LiteLLM, etc. |

 ---

 ## Backend Selection

-User toggles backend in the UI: `claude → gemini → local` (cycles). The active backend is stored server-side; the UI reflects it with color coding (default / green / amber).
+### Default: Role-Based Routing (Auto)

-When local is active, the active model name appears below the toggle button.
+When no explicit backend is selected, Cortex routes to the model configured for the
+request's **role** in the user's model registry. Roles: `chat`, `orchestrator`, `distill`,
+`coder`, `research` (extensible via `DEFINED_ROLES` in `.env`).

-**Fallback chain** (automatic, on error):
+Resolution order for a role:
+1. User registry: `roles[role].primary → backup_1 → backup_2 → backup_3 → backup_4`
+2. `.env` role default: `ROLE_CHAT=claude_cli`, `ROLE_DISTILL=gemini_api`, etc.
+3. Hardcoded last-resort: `chat/distill/coder → claude_cli`, `orchestrator/research → gemini_api`
+
+### Explicit Override
+
+The UI backend toggle cycles: **auto → claude → gemini → local → auto**
+
+- **auto** (default): role-based routing as above; sends `model: null` to `/chat`
+- **claude / gemini / local**: bypasses role routing; forces that specific backend
+- When "local" is active, the configured model name appears below the toggle button
+
+**Fallback chain** (automatic, on any error):
 ```
 claude  → gemini
 gemini  → claude
 local   → claude
 ```

+Each response includes a model label (bottom-right of the message bubble) showing what
+actually responded. Amber label with `⚡` = fallback was used.
+
 Auth expiry on Claude triggers a UI banner + `claude_auth_expired` SSE event.

 ---

+## Model Registry
+
+Per-user configuration stored in `home/{user}/model_registry.json`.
+
+Hosts and models are managed at **Settings → Model Registry** (`/settings/local`).
+
+### Schema
+
+```json
+{
+  "version": 1,
+  "hosts": [
+    {
+      "id": "abc123",
+      "label": "Home ML Laptop",
+      "api_url": "http://192.168.x.x:3000",
+      "api_key": "sk-...",
+      "host_type": "openwebui"
+    }
+  ],
+  "models": [
+    {
+      "id": "def456",
+      "type": "local_openai",
+      "label": "Gemma Medium",
+      "model_name": "agent-support-gemma-medium",
+      "host_id": "abc123",
+      "context_k": 50,
+      "tags": ["chat", "fast"]
+    }
+  ],
+  "roles": {
+    "chat": {
+      "primary": "def456",
+      "backup_1": "claude_cli"
+    }
+  }
+}
+```
+
+### host_type
+
+Controls which API path layout is used:
+
+| `host_type` | Chat endpoint | Models endpoint | Use for |
+|---|---|---|---|
+| `openwebui` (default) | `POST {url}/api/chat/completions` | `GET {url}/api/models` | Open WebUI, Ollama |
+| `openai` | `POST {url}/chat/completions` | `GET {url}/models` | OpenRouter, LiteLLM, Anthropic-compat |
+
+Set `api_url` to the base path ending just before `/chat/completions`:
+- OpenRouter: `https://openrouter.ai/api/v1`
+- LiteLLM proxy: `http://host:port`
+
+### Built-in model IDs
+
+Always resolvable without a registry entry:
+
+| ID | Backend |
+|---|---|
+| `claude_cli` | Claude CLI subprocess |
+| `gemini_cli` | Gemini CLI subprocess |
+| `gemini_api` | Gemini API (SDK) — orchestrator only |
+
+---
+
 ## Claude Backend (`_claude()`)

 Runs `claude --print --no-session-persistence --output-format text` as a subprocess.

 - System prompt passed via `--system-prompt`
 - Conversation history formatted as `<conversation>` block
- Token read live from `~/.claude/.credentials.json` on every call — never relies on the env var, which goes stale after `claude auth login`
- Model override via `--model` flag (e.g. `claude-opus-4-6`)
+- Token read live from `~/.claude/.credentials.json` on every call — never relies on the
+  env var, which goes stale after `claude auth login`
+- Model override via `--model` flag when a specific `model_name` is configured in the registry

 Timeout: `TIMEOUT_CLAUDE=60` seconds (`.env`)

@@ -51,7 +134,7 @@ Timeout: `TIMEOUT_CLAUDE=60` seconds (`.env`)

 Runs `gemini --output-format text --extensions "" -p <prompt>` as a subprocess.

- `--extensions ""` disables all MCP extensions — prevents child processes from keeping pipes open after responding
+- `--extensions ""` disables all MCP extensions — prevents child processes keeping pipes open
 - `start_new_session=True` puts the process in its own group for clean `os.killpg` on timeout
 - Output is cleaned to strip CLI noise lines (loading messages, retry notices, quota warnings)

@@ -61,46 +144,33 @@ Timeout: `TIMEOUT_GEMINI=120` seconds (`.env`)

 ## Local Backend (`_local()`)

-HTTP POST to Open WebUI's OpenAI-compatible endpoint: `{api_url}/api/chat/completions`.
+HTTP POST to an OpenAI-compatible endpoint. Model config is resolved via the model registry.

-Per-user config in `home/{user}/local_llm.json`:
-```json
-{
-  "hosts": [{"id": "...", "label": "scott_gaming", "api_url": "http://192.168.32.19:3000", "api_key": "sk-..."}],
-  "models": [{"id": "...", "host_id": "...", "label": "Gemma 4 Small", "model_name": "agent-support-gemma-small"}],
-  "active_model_id": "..."
-}
+```python
+# host_type "openwebui":  POST {api_url}/api/chat/completions
+# host_type "openai":     POST {api_url}/chat/completions
 ```

-Resolution order for active model:
-1. User's `active_model_id` in `local_llm.json`
-2. `.env` server defaults (`LOCAL_API_URL` / `LOCAL_MODEL`)
-3. Error — user is prompted to configure at `/settings/local`
-
 Timeout: `TIMEOUT_LOCAL=300` seconds (`.env`) — local models may need to load from disk.

-**Manage at:** `/settings/local` — supports multiple hosts and models per user, "Fetch from host" button to populate model list from the server.
+---
+
+## Distillation
+
+Memory distillation uses `role="distill"` for mid and long passes. Configure the distill
+model via the Model Registry → Role Assignments → Distill role.
+
+`.env` override: `ROLE_DISTILL=claude_cli` (default). Set to any built-in ID or leave blank
+to fall through to the hardcoded default (`claude_cli`).

 ---

-## Distillation Backends
+## Code locations

-Memory distillation runs on a schedule and uses the LLM for mid and long distill passes. By default uses the primary backend (`claude`). Override in `.env`:
-
-```
-DISTILL_BACKEND_MID=local   # saves API credits — Gemma handles summarization well
-DISTILL_BACKEND_LONG=       # empty = use primary (claude recommended for quality)
-```
-
---
-
-## Current Local Models (scott_gaming, 8 GB VRAM)
-
-| Model | Alias | Speed | Practical Context |
-|---|---|---|---|
-| Gemma 4 E4B | `agent-support-gemma-small` | ~25 t/s | **72k tokens** |
-| Gemma 4 26B A4B (MoE) | `agent-support-gemma-medium` | ~9 t/s | **50k tokens** |
-
-Both support OpenAI `tools` / `tool_choice` function calling — required for the local orchestrator.
-
-Full Open WebUI API reference: [`docs/OPEN_WEBUI_API.md`](../docs/OPEN_WEBUI_API.md)
+| File | Responsibility |
+|---|---|
+| `cortex/llm_client.py` | `complete()` — routing, dispatch, fallback |
+| `cortex/model_registry.py` | Per-user registry CRUD and resolution |
+| `cortex/routers/local_llm.py` | Settings UI routes + `/api/models/role` AJAX |
+| `cortex/routers/chat.py` | `_backend_label()`, `fallback_used` flag |
+| `cortex/config.py` | `ROLE_*` env defaults, `DEFINED_ROLES`, `PRIMARY_BACKEND` |