Files
Cortex-Inara/documentation/ARCH__BACKENDS.md
Scott Idem d9a322164a feat: OpenAI-compatible orchestrator + backend auto-routing
- openai_orchestrator.py — new ReAct tool loop engine for any
  OpenAI-compatible endpoint (OpenRouter, Open WebUI, Ollama, LiteLLM);
  model handles both tool loop and final response, no Claude handoff needed
- tools/__init__.py — auto-derive OpenAI JSON Schema from existing Gemini
  FunctionDeclarations so tool definitions have a single source of truth
- routers/orchestrator.py — route to openai_orchestrator when model registry
  "orchestrator" role resolves to a local_openai type host
- routers/chat.py — pass role to _backend_label(); fix fallback_used logic
  (only meaningful for explicit backend overrides, not auto-routing)
- static/app.js — add null/"auto" to backend cycle; fetch local model hint
  without overriding the auto default on page load
- model_registry.py — _normalize() back-fills host_type on old registry files
- requirements.txt — add openai>=1.0.0
- ARCH__BACKENDS.md — document OpenAI-compat backend and routing logic

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 19:18:18 -04:00

177 lines
5.4 KiB
Markdown

# Architecture: LLM Backends
> How Cortex selects and talks to AI models.
> Last updated: 2026-04-06
---
## Backends
| Backend | Type | Auth | Notes |
|---|---|---|---|
| **Claude CLI** | `claude_cli` | OAuth token from `~/.claude/.credentials.json` | Primary chat; model set via `DEFAULT_MODEL` in `.env` |
| **Gemini CLI** | `gemini_cli` | Gemini CLI credentials | Fallback / explicit selection |
| **Gemini API** | `gemini_api` | `GEMINI_API_KEY` in `.env` | Orchestrator tool loop only — not general chat |
| **Local (OpenAI-compat)** | `local_openai` | API key per host in model registry | Open WebUI, Ollama, OpenRouter, LiteLLM, etc. |
---
## Backend Selection
### Default: Role-Based Routing (Auto)
When no explicit backend is selected, Cortex routes to the model configured for the
request's **role** in the user's model registry. Roles: `chat`, `orchestrator`, `distill`,
`coder`, `research` (extensible via `DEFINED_ROLES` in `.env`).
Resolution order for a role:
1. User registry: `roles[role].primary → backup_1 → backup_2 → backup_3 → backup_4`
2. `.env` role default: `ROLE_CHAT=claude_cli`, `ROLE_DISTILL=gemini_api`, etc.
3. Hardcoded last-resort: `chat/distill/coder → claude_cli`, `orchestrator/research → gemini_api`
### Explicit Override
The UI backend toggle cycles: **auto → claude → gemini → local → auto**
- **auto** (default): role-based routing as above; sends `model: null` to `/chat`
- **claude / gemini / local**: bypasses role routing; forces that specific backend
- When "local" is active, the configured model name appears below the toggle button
**Fallback chain** (automatic, on any error):
```
claude → gemini
gemini → claude
local → claude
```
Each response includes a model label (bottom-right of the message bubble) showing what
actually responded. Amber label with `⚡` = fallback was used.
Auth expiry on Claude triggers a UI banner + `claude_auth_expired` SSE event.
---
## Model Registry
Per-user configuration stored in `home/{user}/model_registry.json`.
Hosts and models are managed at **Settings → Model Registry** (`/settings/local`).
### Schema
```json
{
"version": 1,
"hosts": [
{
"id": "abc123",
"label": "Home ML Laptop",
"api_url": "http://192.168.x.x:3000",
"api_key": "sk-...",
"host_type": "openwebui"
}
],
"models": [
{
"id": "def456",
"type": "local_openai",
"label": "Gemma Medium",
"model_name": "agent-support-gemma-medium",
"host_id": "abc123",
"context_k": 50,
"tags": ["chat", "fast"]
}
],
"roles": {
"chat": {
"primary": "def456",
"backup_1": "claude_cli"
}
}
}
```
### host_type
Controls which API path layout is used:
| `host_type` | Chat endpoint | Models endpoint | Use for |
|---|---|---|---|
| `openwebui` (default) | `POST {url}/api/chat/completions` | `GET {url}/api/models` | Open WebUI, Ollama |
| `openai` | `POST {url}/chat/completions` | `GET {url}/models` | OpenRouter, LiteLLM, Anthropic-compat |
Set `api_url` to the base path ending just before `/chat/completions`:
- OpenRouter: `https://openrouter.ai/api/v1`
- LiteLLM proxy: `http://host:port`
### Built-in model IDs
Always resolvable without a registry entry:
| ID | Backend |
|---|---|
| `claude_cli` | Claude CLI subprocess |
| `gemini_cli` | Gemini CLI subprocess |
| `gemini_api` | Gemini API (SDK) — orchestrator only |
---
## Claude Backend (`_claude()`)
Runs `claude --print --no-session-persistence --output-format text` as a subprocess.
- System prompt passed via `--system-prompt`
- Conversation history formatted as `<conversation>` block
- Token read live from `~/.claude/.credentials.json` on every call — never relies on the
env var, which goes stale after `claude auth login`
- Model override via `--model` flag when a specific `model_name` is configured in the registry
Timeout: `TIMEOUT_CLAUDE=60` seconds (`.env`)
---
## Gemini CLI Backend (`_gemini()`)
Runs `gemini --output-format text --extensions "" -p <prompt>` as a subprocess.
- `--extensions ""` disables all MCP extensions — prevents child processes keeping pipes open
- `start_new_session=True` puts the process in its own group for clean `os.killpg` on timeout
- Output is cleaned to strip CLI noise lines (loading messages, retry notices, quota warnings)
Timeout: `TIMEOUT_GEMINI=120` seconds (`.env`)
---
## Local Backend (`_local()`)
HTTP POST to an OpenAI-compatible endpoint. Model config is resolved via the model registry.
```python
# host_type "openwebui": POST {api_url}/api/chat/completions
# host_type "openai": POST {api_url}/chat/completions
```
Timeout: `TIMEOUT_LOCAL=300` seconds (`.env`) — local models may need to load from disk.
---
## Distillation
Memory distillation uses `role="distill"` for mid and long passes. Configure the distill
model via the Model Registry → Role Assignments → Distill role.
`.env` override: `ROLE_DISTILL=claude_cli` (default). Set to any built-in ID or leave blank
to fall through to the hardcoded default (`claude_cli`).
---
## Code locations
| File | Responsibility |
|---|---|
| `cortex/llm_client.py` | `complete()` — routing, dispatch, fallback |
| `cortex/model_registry.py` | Per-user registry CRUD and resolution |
| `cortex/routers/local_llm.py` | Settings UI routes + `/api/models/role` AJAX |
| `cortex/routers/chat.py` | `_backend_label()`, `fallback_used` flag |
| `cortex/config.py` | `ROLE_*` env defaults, `DEFINED_ROLES`, `PRIMARY_BACKEND` |