# Architecture: LLM Backends > How Cortex talks to AI models. > Last updated: 2026-04-03 --- ## Three Backends | Backend | Used for | Auth | Config | |---|---|---|---| | **Claude CLI** | Primary chat, all user-facing responses | OAuth token from `~/.claude/.credentials.json` | `DEFAULT_MODEL` in `.env` | | **Gemini CLI** | Fallback when Claude unavailable | Gemini CLI credentials | Auto-fallback | | **Local (Open WebUI)** | Private/offline tasks, cost-free use | API key per user in `local_llm.json` | `/settings/local` UI | The **Gemini API** (google-genai SDK) is also used — but only by the orchestrator tool loop, not as a general chat backend. See [`ARCH__FUTURE.md`](ARCH__FUTURE.md) for the orchestrator pattern. --- ## Backend Selection User toggles backend in the UI: `claude → gemini → local` (cycles). The active backend is stored server-side; the UI reflects it with color coding (default / green / amber). When local is active, the active model name appears below the toggle button. **Fallback chain** (automatic, on error): ``` claude → gemini gemini → claude local → claude ``` Auth expiry on Claude triggers a UI banner + `claude_auth_expired` SSE event. --- ## Claude Backend (`_claude()`) Runs `claude --print --no-session-persistence --output-format text` as a subprocess. - System prompt passed via `--system-prompt` - Conversation history formatted as `` block - Token read live from `~/.claude/.credentials.json` on every call — never relies on the env var, which goes stale after `claude auth login` - Model override via `--model` flag (e.g. `claude-opus-4-6`) Timeout: `TIMEOUT_CLAUDE=60` seconds (`.env`) --- ## Gemini CLI Backend (`_gemini()`) Runs `gemini --output-format text --extensions "" -p ` as a subprocess. - `--extensions ""` disables all MCP extensions — prevents child processes from keeping pipes open after responding - `start_new_session=True` puts the process in its own group for clean `os.killpg` on timeout - Output is cleaned to strip CLI noise lines (loading messages, retry notices, quota warnings) Timeout: `TIMEOUT_GEMINI=120` seconds (`.env`) --- ## Local Backend (`_local()`) HTTP POST to Open WebUI's OpenAI-compatible endpoint: `{api_url}/api/chat/completions`. Per-user config in `home/{user}/local_llm.json`: ```json { "hosts": [{"id": "...", "label": "scott_gaming", "api_url": "http://192.168.32.19:3000", "api_key": "sk-..."}], "models": [{"id": "...", "host_id": "...", "label": "Gemma 4 Small", "model_name": "agent-support-gemma-small"}], "active_model_id": "..." } ``` Resolution order for active model: 1. User's `active_model_id` in `local_llm.json` 2. `.env` server defaults (`LOCAL_API_URL` / `LOCAL_MODEL`) 3. Error — user is prompted to configure at `/settings/local` Timeout: `TIMEOUT_LOCAL=300` seconds (`.env`) — local models may need to load from disk. **Manage at:** `/settings/local` — supports multiple hosts and models per user, "Fetch from host" button to populate model list from the server. --- ## Distillation Backends Memory distillation runs on a schedule and uses the LLM for mid and long distill passes. By default uses the primary backend (`claude`). Override in `.env`: ``` DISTILL_BACKEND_MID=local # saves API credits — Gemma handles summarization well DISTILL_BACKEND_LONG= # empty = use primary (claude recommended for quality) ``` --- ## Current Local Models (scott_gaming, 8 GB VRAM) | Model | Alias | Speed | Practical Context | |---|---|---|---| | Gemma 4 E4B | `agent-support-gemma-small` | ~25 t/s | **72k tokens** | | Gemma 4 26B A4B (MoE) | `agent-support-gemma-medium` | ~9 t/s | **50k tokens** | Both support OpenAI `tools` / `tool_choice` function calling — required for the local orchestrator. Full Open WebUI API reference: [`docs/OPEN_WEBUI_API.md`](../docs/OPEN_WEBUI_API.md)