- openai_orchestrator.py — new ReAct tool loop engine for any OpenAI-compatible endpoint (OpenRouter, Open WebUI, Ollama, LiteLLM); model handles both tool loop and final response, no Claude handoff needed - tools/__init__.py — auto-derive OpenAI JSON Schema from existing Gemini FunctionDeclarations so tool definitions have a single source of truth - routers/orchestrator.py — route to openai_orchestrator when model registry "orchestrator" role resolves to a local_openai type host - routers/chat.py — pass role to _backend_label(); fix fallback_used logic (only meaningful for explicit backend overrides, not auto-routing) - static/app.js — add null/"auto" to backend cycle; fetch local model hint without overriding the auto default on page load - model_registry.py — _normalize() back-fills host_type on old registry files - requirements.txt — add openai>=1.0.0 - ARCH__BACKENDS.md — document OpenAI-compat backend and routing logic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.4 KiB
Architecture: LLM Backends
How Cortex selects and talks to AI models. Last updated: 2026-04-06
Backends
| Backend | Type | Auth | Notes |
|---|---|---|---|
| Claude CLI | claude_cli |
OAuth token from ~/.claude/.credentials.json |
Primary chat; model set via DEFAULT_MODEL in .env |
| Gemini CLI | gemini_cli |
Gemini CLI credentials | Fallback / explicit selection |
| Gemini API | gemini_api |
GEMINI_API_KEY in .env |
Orchestrator tool loop only — not general chat |
| Local (OpenAI-compat) | local_openai |
API key per host in model registry | Open WebUI, Ollama, OpenRouter, LiteLLM, etc. |
Backend Selection
Default: Role-Based Routing (Auto)
When no explicit backend is selected, Cortex routes to the model configured for the
request's role in the user's model registry. Roles: chat, orchestrator, distill,
coder, research (extensible via DEFINED_ROLES in .env).
Resolution order for a role:
- User registry:
roles[role].primary → backup_1 → backup_2 → backup_3 → backup_4 .envrole default:ROLE_CHAT=claude_cli,ROLE_DISTILL=gemini_api, etc.- Hardcoded last-resort:
chat/distill/coder → claude_cli,orchestrator/research → gemini_api
Explicit Override
The UI backend toggle cycles: auto → claude → gemini → local → auto
- auto (default): role-based routing as above; sends
model: nullto/chat - claude / gemini / local: bypasses role routing; forces that specific backend
- When "local" is active, the configured model name appears below the toggle button
Fallback chain (automatic, on any error):
claude → gemini
gemini → claude
local → claude
Each response includes a model label (bottom-right of the message bubble) showing what
actually responded. Amber label with ⚡ = fallback was used.
Auth expiry on Claude triggers a UI banner + claude_auth_expired SSE event.
Model Registry
Per-user configuration stored in home/{user}/model_registry.json.
Hosts and models are managed at Settings → Model Registry (/settings/local).
Schema
{
"version": 1,
"hosts": [
{
"id": "abc123",
"label": "Home ML Laptop",
"api_url": "http://192.168.x.x:3000",
"api_key": "sk-...",
"host_type": "openwebui"
}
],
"models": [
{
"id": "def456",
"type": "local_openai",
"label": "Gemma Medium",
"model_name": "agent-support-gemma-medium",
"host_id": "abc123",
"context_k": 50,
"tags": ["chat", "fast"]
}
],
"roles": {
"chat": {
"primary": "def456",
"backup_1": "claude_cli"
}
}
}
host_type
Controls which API path layout is used:
host_type |
Chat endpoint | Models endpoint | Use for |
|---|---|---|---|
openwebui (default) |
POST {url}/api/chat/completions |
GET {url}/api/models |
Open WebUI, Ollama |
openai |
POST {url}/chat/completions |
GET {url}/models |
OpenRouter, LiteLLM, Anthropic-compat |
Set api_url to the base path ending just before /chat/completions:
- OpenRouter:
https://openrouter.ai/api/v1 - LiteLLM proxy:
http://host:port
Built-in model IDs
Always resolvable without a registry entry:
| ID | Backend |
|---|---|
claude_cli |
Claude CLI subprocess |
gemini_cli |
Gemini CLI subprocess |
gemini_api |
Gemini API (SDK) — orchestrator only |
Claude Backend (_claude())
Runs claude --print --no-session-persistence --output-format text as a subprocess.
- System prompt passed via
--system-prompt - Conversation history formatted as
<conversation>block - Token read live from
~/.claude/.credentials.jsonon every call — never relies on the env var, which goes stale afterclaude auth login - Model override via
--modelflag when a specificmodel_nameis configured in the registry
Timeout: TIMEOUT_CLAUDE=60 seconds (.env)
Gemini CLI Backend (_gemini())
Runs gemini --output-format text --extensions "" -p <prompt> as a subprocess.
--extensions ""disables all MCP extensions — prevents child processes keeping pipes openstart_new_session=Trueputs the process in its own group for cleanos.killpgon timeout- Output is cleaned to strip CLI noise lines (loading messages, retry notices, quota warnings)
Timeout: TIMEOUT_GEMINI=120 seconds (.env)
Local Backend (_local())
HTTP POST to an OpenAI-compatible endpoint. Model config is resolved via the model registry.
# host_type "openwebui": POST {api_url}/api/chat/completions
# host_type "openai": POST {api_url}/chat/completions
Timeout: TIMEOUT_LOCAL=300 seconds (.env) — local models may need to load from disk.
Distillation
Memory distillation uses role="distill" for mid and long passes. Configure the distill
model via the Model Registry → Role Assignments → Distill role.
.env override: ROLE_DISTILL=claude_cli (default). Set to any built-in ID or leave blank
to fall through to the hardcoded default (claude_cli).
Code locations
| File | Responsibility |
|---|---|
cortex/llm_client.py |
complete() — routing, dispatch, fallback |
cortex/model_registry.py |
Per-user registry CRUD and resolution |
cortex/routers/local_llm.py |
Settings UI routes + /api/models/role AJAX |
cortex/routers/chat.py |
_backend_label(), fallback_used flag |
cortex/config.py |
ROLE_* env defaults, DEFINED_ROLES, PRIMARY_BACKEND |