feat: OpenAI-compatible orchestrator + backend auto-routing

- openai_orchestrator.py — new ReAct tool loop engine for any
  OpenAI-compatible endpoint (OpenRouter, Open WebUI, Ollama, LiteLLM);
  model handles both tool loop and final response, no Claude handoff needed
- tools/__init__.py — auto-derive OpenAI JSON Schema from existing Gemini
  FunctionDeclarations so tool definitions have a single source of truth
- routers/orchestrator.py — route to openai_orchestrator when model registry
  "orchestrator" role resolves to a local_openai type host
- routers/chat.py — pass role to _backend_label(); fix fallback_used logic
  (only meaningful for explicit backend overrides, not auto-routing)
- static/app.js — add null/"auto" to backend cycle; fetch local model hint
  without overriding the auto default on page load
- model_registry.py — _normalize() back-fills host_type on old registry files
- requirements.txt — add openai>=1.0.0
- ARCH__BACKENDS.md — document OpenAI-compat backend and routing logic

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Scott Idem
2026-04-08 19:18:18 -04:00
parent 8ba5247ef5
commit d9a322164a
9 changed files with 538 additions and 112 deletions

View File

@@ -1,47 +1,130 @@
# Architecture: LLM Backends
> How Cortex talks to AI models.
> Last updated: 2026-04-03
> How Cortex selects and talks to AI models.
> Last updated: 2026-04-06
---
## Three Backends
## Backends
| Backend | Used for | Auth | Config |
| Backend | Type | Auth | Notes |
|---|---|---|---|
| **Claude CLI** | Primary chat, all user-facing responses | OAuth token from `~/.claude/.credentials.json` | `DEFAULT_MODEL` in `.env` |
| **Gemini CLI** | Fallback when Claude unavailable | Gemini CLI credentials | Auto-fallback |
| **Local (Open WebUI)** | Private/offline tasks, cost-free use | API key per user in `local_llm.json` | `/settings/local` UI |
The **Gemini API** (google-genai SDK) is also used — but only by the orchestrator tool loop, not as a general chat backend. See [`ARCH__FUTURE.md`](ARCH__FUTURE.md) for the orchestrator pattern.
| **Claude CLI** | `claude_cli` | OAuth token from `~/.claude/.credentials.json` | Primary chat; model set via `DEFAULT_MODEL` in `.env` |
| **Gemini CLI** | `gemini_cli` | Gemini CLI credentials | Fallback / explicit selection |
| **Gemini API** | `gemini_api` | `GEMINI_API_KEY` in `.env` | Orchestrator tool loop only — not general chat |
| **Local (OpenAI-compat)** | `local_openai` | API key per host in model registry | Open WebUI, Ollama, OpenRouter, LiteLLM, etc. |
---
## Backend Selection
User toggles backend in the UI: `claude → gemini → local` (cycles). The active backend is stored server-side; the UI reflects it with color coding (default / green / amber).
### Default: Role-Based Routing (Auto)
When local is active, the active model name appears below the toggle button.
When no explicit backend is selected, Cortex routes to the model configured for the
request's **role** in the user's model registry. Roles: `chat`, `orchestrator`, `distill`,
`coder`, `research` (extensible via `DEFINED_ROLES` in `.env`).
**Fallback chain** (automatic, on error):
Resolution order for a role:
1. User registry: `roles[role].primary → backup_1 → backup_2 → backup_3 → backup_4`
2. `.env` role default: `ROLE_CHAT=claude_cli`, `ROLE_DISTILL=gemini_api`, etc.
3. Hardcoded last-resort: `chat/distill/coder → claude_cli`, `orchestrator/research → gemini_api`
### Explicit Override
The UI backend toggle cycles: **auto → claude → gemini → local → auto**
- **auto** (default): role-based routing as above; sends `model: null` to `/chat`
- **claude / gemini / local**: bypasses role routing; forces that specific backend
- When "local" is active, the configured model name appears below the toggle button
**Fallback chain** (automatic, on any error):
```
claude → gemini
gemini → claude
local → claude
```
Each response includes a model label (bottom-right of the message bubble) showing what
actually responded. Amber label with `⚡` = fallback was used.
Auth expiry on Claude triggers a UI banner + `claude_auth_expired` SSE event.
---
## Model Registry
Per-user configuration stored in `home/{user}/model_registry.json`.
Hosts and models are managed at **Settings → Model Registry** (`/settings/local`).
### Schema
```json
{
"version": 1,
"hosts": [
{
"id": "abc123",
"label": "Home ML Laptop",
"api_url": "http://192.168.x.x:3000",
"api_key": "sk-...",
"host_type": "openwebui"
}
],
"models": [
{
"id": "def456",
"type": "local_openai",
"label": "Gemma Medium",
"model_name": "agent-support-gemma-medium",
"host_id": "abc123",
"context_k": 50,
"tags": ["chat", "fast"]
}
],
"roles": {
"chat": {
"primary": "def456",
"backup_1": "claude_cli"
}
}
}
```
### host_type
Controls which API path layout is used:
| `host_type` | Chat endpoint | Models endpoint | Use for |
|---|---|---|---|
| `openwebui` (default) | `POST {url}/api/chat/completions` | `GET {url}/api/models` | Open WebUI, Ollama |
| `openai` | `POST {url}/chat/completions` | `GET {url}/models` | OpenRouter, LiteLLM, Anthropic-compat |
Set `api_url` to the base path ending just before `/chat/completions`:
- OpenRouter: `https://openrouter.ai/api/v1`
- LiteLLM proxy: `http://host:port`
### Built-in model IDs
Always resolvable without a registry entry:
| ID | Backend |
|---|---|
| `claude_cli` | Claude CLI subprocess |
| `gemini_cli` | Gemini CLI subprocess |
| `gemini_api` | Gemini API (SDK) — orchestrator only |
---
## Claude Backend (`_claude()`)
Runs `claude --print --no-session-persistence --output-format text` as a subprocess.
- System prompt passed via `--system-prompt`
- Conversation history formatted as `<conversation>` block
- Token read live from `~/.claude/.credentials.json` on every call — never relies on the env var, which goes stale after `claude auth login`
- Model override via `--model` flag (e.g. `claude-opus-4-6`)
- Token read live from `~/.claude/.credentials.json` on every call — never relies on the
env var, which goes stale after `claude auth login`
- Model override via `--model` flag when a specific `model_name` is configured in the registry
Timeout: `TIMEOUT_CLAUDE=60` seconds (`.env`)
@@ -51,7 +134,7 @@ Timeout: `TIMEOUT_CLAUDE=60` seconds (`.env`)
Runs `gemini --output-format text --extensions "" -p <prompt>` as a subprocess.
- `--extensions ""` disables all MCP extensions — prevents child processes from keeping pipes open after responding
- `--extensions ""` disables all MCP extensions — prevents child processes keeping pipes open
- `start_new_session=True` puts the process in its own group for clean `os.killpg` on timeout
- Output is cleaned to strip CLI noise lines (loading messages, retry notices, quota warnings)
@@ -61,46 +144,33 @@ Timeout: `TIMEOUT_GEMINI=120` seconds (`.env`)
## Local Backend (`_local()`)
HTTP POST to Open WebUI's OpenAI-compatible endpoint: `{api_url}/api/chat/completions`.
HTTP POST to an OpenAI-compatible endpoint. Model config is resolved via the model registry.
Per-user config in `home/{user}/local_llm.json`:
```json
{
"hosts": [{"id": "...", "label": "scott_gaming", "api_url": "http://192.168.32.19:3000", "api_key": "sk-..."}],
"models": [{"id": "...", "host_id": "...", "label": "Gemma 4 Small", "model_name": "agent-support-gemma-small"}],
"active_model_id": "..."
}
```python
# host_type "openwebui": POST {api_url}/api/chat/completions
# host_type "openai": POST {api_url}/chat/completions
```
Resolution order for active model:
1. User's `active_model_id` in `local_llm.json`
2. `.env` server defaults (`LOCAL_API_URL` / `LOCAL_MODEL`)
3. Error — user is prompted to configure at `/settings/local`
Timeout: `TIMEOUT_LOCAL=300` seconds (`.env`) — local models may need to load from disk.
**Manage at:** `/settings/local` — supports multiple hosts and models per user, "Fetch from host" button to populate model list from the server.
---
## Distillation
Memory distillation uses `role="distill"` for mid and long passes. Configure the distill
model via the Model Registry → Role Assignments → Distill role.
`.env` override: `ROLE_DISTILL=claude_cli` (default). Set to any built-in ID or leave blank
to fall through to the hardcoded default (`claude_cli`).
---
## Distillation Backends
## Code locations
Memory distillation runs on a schedule and uses the LLM for mid and long distill passes. By default uses the primary backend (`claude`). Override in `.env`:
```
DISTILL_BACKEND_MID=local # saves API credits — Gemma handles summarization well
DISTILL_BACKEND_LONG= # empty = use primary (claude recommended for quality)
```
---
## Current Local Models (scott_gaming, 8 GB VRAM)
| Model | Alias | Speed | Practical Context |
|---|---|---|---|
| Gemma 4 E4B | `agent-support-gemma-small` | ~25 t/s | **72k tokens** |
| Gemma 4 26B A4B (MoE) | `agent-support-gemma-medium` | ~9 t/s | **50k tokens** |
Both support OpenAI `tools` / `tool_choice` function calling — required for the local orchestrator.
Full Open WebUI API reference: [`docs/OPEN_WEBUI_API.md`](../docs/OPEN_WEBUI_API.md)
| File | Responsibility |
|---|---|
| `cortex/llm_client.py` | `complete()` — routing, dispatch, fallback |
| `cortex/model_registry.py` | Per-user registry CRUD and resolution |
| `cortex/routers/local_llm.py` | Settings UI routes + `/api/models/role` AJAX |
| `cortex/routers/chat.py` | `_backend_label()`, `fallback_used` flag |
| `cortex/config.py` | `ROLE_*` env defaults, `DEFINED_ROLES`, `PRIMARY_BACKEND` |