Cortex-Inara/documentation/ARCH__BACKENDS.md

# Architecture: LLM Backends

> How Cortex selects and talks to AI models.
> Last updated: 2026-04-06

---

## Backends

| Backend | Type | Auth | Notes |
|---|---|---|---|
| **Claude CLI** | `claude_cli` | OAuth token from `~/.claude/.credentials.json` | Primary chat; model set via `DEFAULT_MODEL` in `.env` |
| **Gemini CLI** | `gemini_cli` | Gemini CLI credentials | Fallback / explicit selection |
| **Gemini API** | `gemini_api` | `GEMINI_API_KEY` in `.env` | Orchestrator tool loop only — not general chat |
| **Local (OpenAI-compat)** | `local_openai` | API key per host in model registry | Open WebUI, Ollama, OpenRouter, LiteLLM, etc. |

---

## Backend Selection

### Default: Role-Based Routing (Auto)

When no explicit backend is selected, Cortex routes to the model configured for the
request's **role** in the user's model registry. Roles: `chat`, `orchestrator`, `distill`,
`coder`, `research` (extensible via `DEFINED_ROLES` in `.env`).

Resolution order for a role:
1. User registry: `roles[role].primary → backup_1 → backup_2 → backup_3 → backup_4`
2. `.env` role default: `ROLE_CHAT=claude_cli`, `ROLE_DISTILL=gemini_api`, etc.
3. Hardcoded last-resort: `chat/distill/coder → claude_cli`, `orchestrator/research → gemini_api`

### Explicit Override

The UI backend toggle cycles: **auto → claude → gemini → local → auto**

- **auto** (default): role-based routing as above; sends `model: null` to `/chat`
- **claude / gemini / local**: bypasses role routing; forces that specific backend
- When "local" is active, the configured model name appears below the toggle button

**Fallback chain** (automatic, on any error):
```
claude  → gemini
gemini  → claude
local   → claude
```

Each response includes a model label (bottom-right of the message bubble) showing what
actually responded. Amber label with `⚡` = fallback was used.

Auth expiry on Claude triggers a UI banner + `claude_auth_expired` SSE event.

---

## Model Registry

Per-user configuration stored in `home/{user}/model_registry.json`.

Hosts and models are managed at **Settings → Model Registry** (`/settings/local`).

### Schema

```json
{
  "version": 1,
  "hosts": [
    {
      "id": "abc123",
      "label": "Home ML Laptop",
      "api_url": "http://192.168.x.x:3000",
      "api_key": "sk-...",
      "host_type": "openwebui"
    }
  ],
  "models": [
    {
      "id": "def456",
      "type": "local_openai",
      "label": "Gemma Medium",
      "model_name": "agent-support-gemma-medium",
      "host_id": "abc123",
      "context_k": 50,
      "tags": ["chat", "fast"]
    }
  ],
  "roles": {
    "chat": {
      "primary": "def456",
      "backup_1": "claude_cli"
    }
  }
}
```

### host_type

Controls which API path layout is used:

| `host_type` | Chat endpoint | Models endpoint | Use for |
|---|---|---|---|
| `openwebui` (default) | `POST {url}/api/chat/completions` | `GET {url}/api/models` | Open WebUI, Ollama |
| `openai` | `POST {url}/chat/completions` | `GET {url}/models` | OpenRouter, LiteLLM, Anthropic-compat |

Set `api_url` to the base path ending just before `/chat/completions`:
- OpenRouter: `https://openrouter.ai/api/v1`
- LiteLLM proxy: `http://host:port`

### Built-in model IDs

Always resolvable without a registry entry:

| ID | Backend |
|---|---|
| `claude_cli` | Claude CLI subprocess |
| `gemini_cli` | Gemini CLI subprocess |
| `gemini_api` | Gemini API (SDK) — orchestrator only |

---

## Claude Backend (`_claude()`)

Runs `claude --print --no-session-persistence --output-format text` as a subprocess.

- System prompt passed via `--system-prompt`
- Conversation history formatted as `<conversation>` block
- Token read live from `~/.claude/.credentials.json` on every call — never relies on the
  env var, which goes stale after `claude auth login`
- Model override via `--model` flag when a specific `model_name` is configured in the registry

Timeout: `TIMEOUT_CLAUDE=60` seconds (`.env`)

---

## Gemini CLI Backend (`_gemini()`)

Runs `gemini --output-format text --extensions "" -p <prompt>` as a subprocess.

- `--extensions ""` disables all MCP extensions — prevents child processes keeping pipes open
- `start_new_session=True` puts the process in its own group for clean `os.killpg` on timeout
- Output is cleaned to strip CLI noise lines (loading messages, retry notices, quota warnings)

Timeout: `TIMEOUT_GEMINI=120` seconds (`.env`)

---

## Local Backend (`_local()`)

HTTP POST to an OpenAI-compatible endpoint. Model config is resolved via the model registry.

```python
# host_type "openwebui":  POST {api_url}/api/chat/completions
# host_type "openai":     POST {api_url}/chat/completions
```

Timeout: `TIMEOUT_LOCAL=300` seconds (`.env`) — local models may need to load from disk.

---

## Distillation

Memory distillation uses `role="distill"` for mid and long passes. Configure the distill
model via the Model Registry → Role Assignments → Distill role.

`.env` override: `ROLE_DISTILL=claude_cli` (default). Set to any built-in ID or leave blank
to fall through to the hardcoded default (`claude_cli`).

---

## Code locations

| File | Responsibility |
|---|---|
| `cortex/llm_client.py` | `complete()` — routing, dispatch, fallback |
| `cortex/model_registry.py` | Per-user registry CRUD and resolution |
| `cortex/routers/local_llm.py` | Settings UI routes + `/api/models/role` AJAX |
| `cortex/routers/chat.py` | `_backend_label()`, `fallback_used` flag |
| `cortex/config.py` | `ROLE_*` env defaults, `DEFINED_ROLES`, `PRIMARY_BACKEND` |