- openai_orchestrator.py — new ReAct tool loop engine for any OpenAI-compatible endpoint (OpenRouter, Open WebUI, Ollama, LiteLLM); model handles both tool loop and final response, no Claude handoff needed - tools/__init__.py — auto-derive OpenAI JSON Schema from existing Gemini FunctionDeclarations so tool definitions have a single source of truth - routers/orchestrator.py — route to openai_orchestrator when model registry "orchestrator" role resolves to a local_openai type host - routers/chat.py — pass role to _backend_label(); fix fallback_used logic (only meaningful for explicit backend overrides, not auto-routing) - static/app.js — add null/"auto" to backend cycle; fetch local model hint without overriding the auto default on page load - model_registry.py — _normalize() back-fills host_type on old registry files - requirements.txt — add openai>=1.0.0 - ARCH__BACKENDS.md — document OpenAI-compat backend and routing logic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
177 lines
5.4 KiB
Markdown
177 lines
5.4 KiB
Markdown
# Architecture: LLM Backends
|
|
|
|
> How Cortex selects and talks to AI models.
|
|
> Last updated: 2026-04-06
|
|
|
|
---
|
|
|
|
## Backends
|
|
|
|
| Backend | Type | Auth | Notes |
|
|
|---|---|---|---|
|
|
| **Claude CLI** | `claude_cli` | OAuth token from `~/.claude/.credentials.json` | Primary chat; model set via `DEFAULT_MODEL` in `.env` |
|
|
| **Gemini CLI** | `gemini_cli` | Gemini CLI credentials | Fallback / explicit selection |
|
|
| **Gemini API** | `gemini_api` | `GEMINI_API_KEY` in `.env` | Orchestrator tool loop only — not general chat |
|
|
| **Local (OpenAI-compat)** | `local_openai` | API key per host in model registry | Open WebUI, Ollama, OpenRouter, LiteLLM, etc. |
|
|
|
|
---
|
|
|
|
## Backend Selection
|
|
|
|
### Default: Role-Based Routing (Auto)
|
|
|
|
When no explicit backend is selected, Cortex routes to the model configured for the
|
|
request's **role** in the user's model registry. Roles: `chat`, `orchestrator`, `distill`,
|
|
`coder`, `research` (extensible via `DEFINED_ROLES` in `.env`).
|
|
|
|
Resolution order for a role:
|
|
1. User registry: `roles[role].primary → backup_1 → backup_2 → backup_3 → backup_4`
|
|
2. `.env` role default: `ROLE_CHAT=claude_cli`, `ROLE_DISTILL=gemini_api`, etc.
|
|
3. Hardcoded last-resort: `chat/distill/coder → claude_cli`, `orchestrator/research → gemini_api`
|
|
|
|
### Explicit Override
|
|
|
|
The UI backend toggle cycles: **auto → claude → gemini → local → auto**
|
|
|
|
- **auto** (default): role-based routing as above; sends `model: null` to `/chat`
|
|
- **claude / gemini / local**: bypasses role routing; forces that specific backend
|
|
- When "local" is active, the configured model name appears below the toggle button
|
|
|
|
**Fallback chain** (automatic, on any error):
|
|
```
|
|
claude → gemini
|
|
gemini → claude
|
|
local → claude
|
|
```
|
|
|
|
Each response includes a model label (bottom-right of the message bubble) showing what
|
|
actually responded. Amber label with `⚡` = fallback was used.
|
|
|
|
Auth expiry on Claude triggers a UI banner + `claude_auth_expired` SSE event.
|
|
|
|
---
|
|
|
|
## Model Registry
|
|
|
|
Per-user configuration stored in `home/{user}/model_registry.json`.
|
|
|
|
Hosts and models are managed at **Settings → Model Registry** (`/settings/local`).
|
|
|
|
### Schema
|
|
|
|
```json
|
|
{
|
|
"version": 1,
|
|
"hosts": [
|
|
{
|
|
"id": "abc123",
|
|
"label": "Home ML Laptop",
|
|
"api_url": "http://192.168.x.x:3000",
|
|
"api_key": "sk-...",
|
|
"host_type": "openwebui"
|
|
}
|
|
],
|
|
"models": [
|
|
{
|
|
"id": "def456",
|
|
"type": "local_openai",
|
|
"label": "Gemma Medium",
|
|
"model_name": "agent-support-gemma-medium",
|
|
"host_id": "abc123",
|
|
"context_k": 50,
|
|
"tags": ["chat", "fast"]
|
|
}
|
|
],
|
|
"roles": {
|
|
"chat": {
|
|
"primary": "def456",
|
|
"backup_1": "claude_cli"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### host_type
|
|
|
|
Controls which API path layout is used:
|
|
|
|
| `host_type` | Chat endpoint | Models endpoint | Use for |
|
|
|---|---|---|---|
|
|
| `openwebui` (default) | `POST {url}/api/chat/completions` | `GET {url}/api/models` | Open WebUI, Ollama |
|
|
| `openai` | `POST {url}/chat/completions` | `GET {url}/models` | OpenRouter, LiteLLM, Anthropic-compat |
|
|
|
|
Set `api_url` to the base path ending just before `/chat/completions`:
|
|
- OpenRouter: `https://openrouter.ai/api/v1`
|
|
- LiteLLM proxy: `http://host:port`
|
|
|
|
### Built-in model IDs
|
|
|
|
Always resolvable without a registry entry:
|
|
|
|
| ID | Backend |
|
|
|---|---|
|
|
| `claude_cli` | Claude CLI subprocess |
|
|
| `gemini_cli` | Gemini CLI subprocess |
|
|
| `gemini_api` | Gemini API (SDK) — orchestrator only |
|
|
|
|
---
|
|
|
|
## Claude Backend (`_claude()`)
|
|
|
|
Runs `claude --print --no-session-persistence --output-format text` as a subprocess.
|
|
|
|
- System prompt passed via `--system-prompt`
|
|
- Conversation history formatted as `<conversation>` block
|
|
- Token read live from `~/.claude/.credentials.json` on every call — never relies on the
|
|
env var, which goes stale after `claude auth login`
|
|
- Model override via `--model` flag when a specific `model_name` is configured in the registry
|
|
|
|
Timeout: `TIMEOUT_CLAUDE=60` seconds (`.env`)
|
|
|
|
---
|
|
|
|
## Gemini CLI Backend (`_gemini()`)
|
|
|
|
Runs `gemini --output-format text --extensions "" -p <prompt>` as a subprocess.
|
|
|
|
- `--extensions ""` disables all MCP extensions — prevents child processes keeping pipes open
|
|
- `start_new_session=True` puts the process in its own group for clean `os.killpg` on timeout
|
|
- Output is cleaned to strip CLI noise lines (loading messages, retry notices, quota warnings)
|
|
|
|
Timeout: `TIMEOUT_GEMINI=120` seconds (`.env`)
|
|
|
|
---
|
|
|
|
## Local Backend (`_local()`)
|
|
|
|
HTTP POST to an OpenAI-compatible endpoint. Model config is resolved via the model registry.
|
|
|
|
```python
|
|
# host_type "openwebui": POST {api_url}/api/chat/completions
|
|
# host_type "openai": POST {api_url}/chat/completions
|
|
```
|
|
|
|
Timeout: `TIMEOUT_LOCAL=300` seconds (`.env`) — local models may need to load from disk.
|
|
|
|
---
|
|
|
|
## Distillation
|
|
|
|
Memory distillation uses `role="distill"` for mid and long passes. Configure the distill
|
|
model via the Model Registry → Role Assignments → Distill role.
|
|
|
|
`.env` override: `ROLE_DISTILL=claude_cli` (default). Set to any built-in ID or leave blank
|
|
to fall through to the hardcoded default (`claude_cli`).
|
|
|
|
---
|
|
|
|
## Code locations
|
|
|
|
| File | Responsibility |
|
|
|---|---|
|
|
| `cortex/llm_client.py` | `complete()` — routing, dispatch, fallback |
|
|
| `cortex/model_registry.py` | Per-user registry CRUD and resolution |
|
|
| `cortex/routers/local_llm.py` | Settings UI routes + `/api/models/role` AJAX |
|
|
| `cortex/routers/chat.py` | `_backend_label()`, `fallback_used` flag |
|
|
| `cortex/config.py` | `ROLE_*` env defaults, `DEFINED_ROLES`, `PRIMARY_BACKEND` |
|