Local LLM:
- user_settings.py: per-user hosts/models config (local_llm.json)
- routers/local_llm.py + static/local_llm.html: dedicated settings page
- llm_client.py: local OpenAI-compatible backend via httpx
- config.py: LOCAL_API_URL/KEY/MODEL + per-backend timeouts
- Active model shown near backend toggle (amber hint text)
Memory distillation:
- memory_distiller.py: DISTILL_BACKEND_MID/LONG .env overrides
- scheduler.py + notification.py: notify NC Talk after mid/long distill
- notification.py: outbound channel abstraction (NC Talk, extensible)
Session search:
- routers/files.py: GET /sessions/search?q= with excerpts grouped by date
- static/index.html + app.js: search UI in file sidebar with highlight
- _esc() helper to prevent XSS in search results
Proactive cron:
- cron_runner.py: new job types — message (send directly) and brief (LLM + send)
- Both support optional per-job channel override
Channels:
- routers/nextcloud_talk.py: consolidated using notification._send_nct_message()
- routers/auth.py: local backend status in /auth/status
- routers/chat.py: /backend returns {primary, fallback, local_model} object
UI / UX:
- Copy button for user messages (matching assistant)
- Autocomplete disabled on sensitive form fields
- settings.html: local model section replaced with link to /settings/local
Docs overhaul:
- MASTER.md hub + ARCH__SYSTEM/BACKENDS/PERSONA/CHANNELS/FUTURE.md
- ARCH__Intelligence_Layer.md replaced with redirect table
- CORTEX.md trimmed to vision only; README updated
- OPEN_WEBUI_API.md added to docs/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
107 lines
3.8 KiB
Markdown
107 lines
3.8 KiB
Markdown
# Architecture: LLM Backends
|
|
|
|
> How Cortex talks to AI models.
|
|
> Last updated: 2026-04-03
|
|
|
|
---
|
|
|
|
## Three Backends
|
|
|
|
| Backend | Used for | Auth | Config |
|
|
|---|---|---|---|
|
|
| **Claude CLI** | Primary chat, all user-facing responses | OAuth token from `~/.claude/.credentials.json` | `DEFAULT_MODEL` in `.env` |
|
|
| **Gemini CLI** | Fallback when Claude unavailable | Gemini CLI credentials | Auto-fallback |
|
|
| **Local (Open WebUI)** | Private/offline tasks, cost-free use | API key per user in `local_llm.json` | `/settings/local` UI |
|
|
|
|
The **Gemini API** (google-genai SDK) is also used — but only by the orchestrator tool loop, not as a general chat backend. See [`ARCH__FUTURE.md`](ARCH__FUTURE.md) for the orchestrator pattern.
|
|
|
|
---
|
|
|
|
## Backend Selection
|
|
|
|
User toggles backend in the UI: `claude → gemini → local` (cycles). The active backend is stored server-side; the UI reflects it with color coding (default / green / amber).
|
|
|
|
When local is active, the active model name appears below the toggle button.
|
|
|
|
**Fallback chain** (automatic, on error):
|
|
```
|
|
claude → gemini
|
|
gemini → claude
|
|
local → claude
|
|
```
|
|
|
|
Auth expiry on Claude triggers a UI banner + `claude_auth_expired` SSE event.
|
|
|
|
---
|
|
|
|
## Claude Backend (`_claude()`)
|
|
|
|
Runs `claude --print --no-session-persistence --output-format text` as a subprocess.
|
|
|
|
- System prompt passed via `--system-prompt`
|
|
- Conversation history formatted as `<conversation>` block
|
|
- Token read live from `~/.claude/.credentials.json` on every call — never relies on the env var, which goes stale after `claude auth login`
|
|
- Model override via `--model` flag (e.g. `claude-opus-4-6`)
|
|
|
|
Timeout: `TIMEOUT_CLAUDE=60` seconds (`.env`)
|
|
|
|
---
|
|
|
|
## Gemini CLI Backend (`_gemini()`)
|
|
|
|
Runs `gemini --output-format text --extensions "" -p <prompt>` as a subprocess.
|
|
|
|
- `--extensions ""` disables all MCP extensions — prevents child processes from keeping pipes open after responding
|
|
- `start_new_session=True` puts the process in its own group for clean `os.killpg` on timeout
|
|
- Output is cleaned to strip CLI noise lines (loading messages, retry notices, quota warnings)
|
|
|
|
Timeout: `TIMEOUT_GEMINI=120` seconds (`.env`)
|
|
|
|
---
|
|
|
|
## Local Backend (`_local()`)
|
|
|
|
HTTP POST to Open WebUI's OpenAI-compatible endpoint: `{api_url}/api/chat/completions`.
|
|
|
|
Per-user config in `home/{user}/local_llm.json`:
|
|
```json
|
|
{
|
|
"hosts": [{"id": "...", "label": "scott_gaming", "api_url": "http://192.168.32.19:3000", "api_key": "sk-..."}],
|
|
"models": [{"id": "...", "host_id": "...", "label": "Gemma 4 Small", "model_name": "agent-support-gemma-small"}],
|
|
"active_model_id": "..."
|
|
}
|
|
```
|
|
|
|
Resolution order for active model:
|
|
1. User's `active_model_id` in `local_llm.json`
|
|
2. `.env` server defaults (`LOCAL_API_URL` / `LOCAL_MODEL`)
|
|
3. Error — user is prompted to configure at `/settings/local`
|
|
|
|
Timeout: `TIMEOUT_LOCAL=300` seconds (`.env`) — local models may need to load from disk.
|
|
|
|
**Manage at:** `/settings/local` — supports multiple hosts and models per user, "Fetch from host" button to populate model list from the server.
|
|
|
|
---
|
|
|
|
## Distillation Backends
|
|
|
|
Memory distillation runs on a schedule and uses the LLM for mid and long distill passes. By default uses the primary backend (`claude`). Override in `.env`:
|
|
|
|
```
|
|
DISTILL_BACKEND_MID=local # saves API credits — Gemma handles summarization well
|
|
DISTILL_BACKEND_LONG= # empty = use primary (claude recommended for quality)
|
|
```
|
|
|
|
---
|
|
|
|
## Current Local Models (scott_gaming, 8 GB VRAM)
|
|
|
|
| Model | Alias | Speed | Practical Context |
|
|
|---|---|---|---|
|
|
| Gemma 4 E4B | `agent-support-gemma-small` | ~25 t/s | **72k tokens** |
|
|
| Gemma 4 26B A4B (MoE) | `agent-support-gemma-medium` | ~9 t/s | **50k tokens** |
|
|
|
|
Both support OpenAI `tools` / `tool_choice` function calling — required for the local orchestrator.
|
|
|
|
Full Open WebUI API reference: [`docs/OPEN_WEBUI_API.md`](../docs/OPEN_WEBUI_API.md)
|