feat: local LLM multi-model, session search, cron proactive types, notifications, docs overhaul

Local LLM: - user_settings.py: per-user hosts/models config (local_llm.json) - routers/local_llm.py + static/local_llm.html: dedicated settings page - llm_client.py: local OpenAI-compatible backend via httpx - config.py: LOCAL_API_URL/KEY/MODEL + per-backend timeouts - Active model shown near backend toggle (amber hint text) Memory distillation: - memory_distiller.py: DISTILL_BACKEND_MID/LONG .env overrides - scheduler.py + notification.py: notify NC Talk after mid/long distill - notification.py: outbound channel abstraction (NC Talk, extensible) Session search: - routers/files.py: GET /sessions/search?q= with excerpts grouped by date - static/index.html + app.js: search UI in file sidebar with highlight - _esc() helper to prevent XSS in search results Proactive cron: - cron_runner.py: new job types — message (send directly) and brief (LLM + send) - Both support optional per-job channel override Channels: - routers/nextcloud_talk.py: consolidated using notification._send_nct_message() - routers/auth.py: local backend status in /auth/status - routers/chat.py: /backend returns {primary, fallback, local_model} object UI / UX: - Copy button for user messages (matching assistant) - Autocomplete disabled on sensitive form fields - settings.html: local model section replaced with link to /settings/local Docs overhaul: - MASTER.md hub + ARCH__SYSTEM/BACKENDS/PERSONA/CHANNELS/FUTURE.md - ARCH__Intelligence_Layer.md replaced with redirect table - CORTEX.md trimmed to vision only; README updated - OPEN_WEBUI_API.md added to docs/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:53:06 -04:00
parent bd6532e93a
commit a4daebdc9b
33 changed files with 2985 additions and 486 deletions
--- a/documentation/ARCH__BACKENDS.md
+++ b/documentation/ARCH__BACKENDS.md
@@ -0,0 +1,106 @@
+# Architecture: LLM Backends
+
+> How Cortex talks to AI models.
+> Last updated: 2026-04-03
+
+---
+
+## Three Backends
+
+| Backend | Used for | Auth | Config |
+|---|---|---|---|
+| **Claude CLI** | Primary chat, all user-facing responses | OAuth token from `~/.claude/.credentials.json` | `DEFAULT_MODEL` in `.env` |
+| **Gemini CLI** | Fallback when Claude unavailable | Gemini CLI credentials | Auto-fallback |
+| **Local (Open WebUI)** | Private/offline tasks, cost-free use | API key per user in `local_llm.json` | `/settings/local` UI |
+
+The **Gemini API** (google-genai SDK) is also used — but only by the orchestrator tool loop, not as a general chat backend. See [`ARCH__FUTURE.md`](ARCH__FUTURE.md) for the orchestrator pattern.
+
+---
+
+## Backend Selection
+
+User toggles backend in the UI: `claude → gemini → local` (cycles). The active backend is stored server-side; the UI reflects it with color coding (default / green / amber).
+
+When local is active, the active model name appears below the toggle button.
+
+**Fallback chain** (automatic, on error):
+```
+claude  → gemini
+gemini  → claude
+local   → claude
+```
+
+Auth expiry on Claude triggers a UI banner + `claude_auth_expired` SSE event.
+
+---
+
+## Claude Backend (`_claude()`)
+
+Runs `claude --print --no-session-persistence --output-format text` as a subprocess.
+
+- System prompt passed via `--system-prompt`
+- Conversation history formatted as `<conversation>` block
+- Token read live from `~/.claude/.credentials.json` on every call — never relies on the env var, which goes stale after `claude auth login`
+- Model override via `--model` flag (e.g. `claude-opus-4-6`)
+
+Timeout: `TIMEOUT_CLAUDE=60` seconds (`.env`)
+
+---
+
+## Gemini CLI Backend (`_gemini()`)
+
+Runs `gemini --output-format text --extensions "" -p <prompt>` as a subprocess.
+
+- `--extensions ""` disables all MCP extensions — prevents child processes from keeping pipes open after responding
+- `start_new_session=True` puts the process in its own group for clean `os.killpg` on timeout
+- Output is cleaned to strip CLI noise lines (loading messages, retry notices, quota warnings)
+
+Timeout: `TIMEOUT_GEMINI=120` seconds (`.env`)
+
+---
+
+## Local Backend (`_local()`)
+
+HTTP POST to Open WebUI's OpenAI-compatible endpoint: `{api_url}/api/chat/completions`.
+
+Per-user config in `home/{user}/local_llm.json`:
+```json
+{
+  "hosts": [{"id": "...", "label": "scott_gaming", "api_url": "http://192.168.32.19:3000", "api_key": "sk-..."}],
+  "models": [{"id": "...", "host_id": "...", "label": "Gemma 4 Small", "model_name": "agent-support-gemma-small"}],
+  "active_model_id": "..."
+}
+```
+
+Resolution order for active model:
+1. User's `active_model_id` in `local_llm.json`
+2. `.env` server defaults (`LOCAL_API_URL` / `LOCAL_MODEL`)
+3. Error — user is prompted to configure at `/settings/local`
+
+Timeout: `TIMEOUT_LOCAL=300` seconds (`.env`) — local models may need to load from disk.
+
+**Manage at:** `/settings/local` — supports multiple hosts and models per user, "Fetch from host" button to populate model list from the server.
+
+---
+
+## Distillation Backends
+
+Memory distillation runs on a schedule and uses the LLM for mid and long distill passes. By default uses the primary backend (`claude`). Override in `.env`:
+
+```
+DISTILL_BACKEND_MID=local   # saves API credits — Gemma handles summarization well
+DISTILL_BACKEND_LONG=       # empty = use primary (claude recommended for quality)
+```
+
+---
+
+## Current Local Models (scott_gaming, 8 GB VRAM)
+
+| Model | Alias | Speed | Practical Context |
+|---|---|---|---|
+| Gemma 4 E4B | `agent-support-gemma-small` | ~25 t/s | **72k tokens** |
+| Gemma 4 26B A4B (MoE) | `agent-support-gemma-medium` | ~9 t/s | **50k tokens** |
+
+Both support OpenAI `tools` / `tool_choice` function calling — required for the local orchestrator.
+
+Full Open WebUI API reference: [`docs/OPEN_WEBUI_API.md`](../docs/OPEN_WEBUI_API.md)
--- a/documentation/ARCH__CHANNELS.md
+++ b/documentation/ARCH__CHANNELS.md
@@ -0,0 +1,149 @@
+# Architecture: Input Channels
+
+> How messages reach Cortex and how Cortex reaches back.
+> Last updated: 2026-04-03
+
+---
+
+## Channel Summary
+
+| Channel | Direction | Auth | Endpoint |
+|---|---|---|---|
+| Web UI | In + Out | JWT session cookie | `/{user}/{persona}` |
+| Nextcloud Talk | In + Out | HMAC-SHA256 | `POST /webhook/nextcloud/{username}` |
+| Google Chat | In + Out | JWT (Google system token) | `POST /channels/google-chat/{username}` |
+| Cron | Out (proactive) | Internal | APScheduler |
+| Webhooks | In (future) | TBD | `POST /webhook/{source}` |
+
+**Per-user config:** Each channel that needs secrets (NC Talk bot key, Google Chat audience) stores them in `home/{username}/channels.json`. No channel access by default — each user sets up their own.
+
+---
+
+## Web UI
+
+Single-page app served from `cortex/static/`. All chat happens via `POST /chat` (streaming SSE for real-time response) or `POST /orchestrate` (async job, polled).
+
+**Session auth:** Login form (`/login`) → bcrypt password check → JWT cookie (30-day expiry). Google OAuth also available (`/auth/google`). All non-public routes require a valid cookie.
+
+**Modes:**
+- **Direct** — message goes straight to LLM via `/chat`
+- **Agent** — message goes to orchestrator (`/orchestrate`), tool loop runs, result polled and streamed into UI
+
+**Context + Memory panel:** Shows current backend (claude/gemini/local), memory tier, active local model. Toggle backend cycles claude → gemini → local.
+
+**Files panel:** Browse and edit persona markdown files in-browser. Session search at the bottom.
+
+**Settings:** `/settings` — Gemini API key, Google account, connected status. `/settings/local` — local model hosts and models.
+
+---
+
+## Nextcloud Talk
+
+Bot integration. The bot is registered in a Talk room; it receives messages, generates a response, and sends it back via the NC Talk bot API.
+
+**Incoming:** `POST /webhook/nextcloud/{username}`
+- Signature verified: `HMAC-SHA256(secret, random + raw_body)`
+- Ignores non-Create events and non-Note types
+- Strips `@{persona}` mention prefix from message text
+- Processes in background task (immediate 200 response to NC Talk)
+
+**Outgoing:** Bot API `POST /ocs/v2.php/apps/spreed/api/v1/bot/{room}/message`
+- Signature: `HMAC-SHA256(secret, random + message_text)` — note: message text, not body
+- Logic lives in `notification.py` (`_send_nct_message`) — shared with proactive notifications
+
+**Proactive notifications:** Set `notification_room` in `channels.json` → `nextcloud`. Used by distill completion alerts and `message`/`brief` cron jobs.
+
+**Per-user config (`channels.json`):**
+```json
+{
+  "nextcloud": {
+    "persona": "inara",
+    "url": "https://cloud.dgrzone.com",
+    "bot_secret": "...",
+    "notification_room": "<room-token>",
+    "timeout": 55
+  }
+}
+```
+
+Full setup guide: [`docs/NEXTCLOUD_TALK_BOT.md`](../docs/NEXTCLOUD_TALK_BOT.md)
+
+---
+
+## Google Chat
+
+Workspace Add-on. Messages arrive as HTTP POST from Google's infrastructure; the handler returns a JSON response synchronously (no background task — Google expects an immediate reply).
+
+**Incoming:** `POST /channels/google-chat/{username}`
+- Auth: JWT in `authorizationEventObject.systemIdToken`, verified against Google's JWKS
+- Response format: `hostAppDataAction.chatDataAction.createMessageAction`
+
+**Per-user config (`channels.json`):**
+```json
+{
+  "google_chat": {
+    "persona": "inara",
+    "audience": "https://cortex.dgrzone.com/channels/google-chat/scott",
+    "backend": "claude",
+    "timeout": 25
+  }
+}
+```
+
+Full setup guide: [`docs/GOOGLE_CHAT_BOT.md`](../docs/GOOGLE_CHAT_BOT.md)
+
+---
+
+## Cron / Proactive Messages
+
+User-defined scheduled jobs stored in `home/{user}/persona/{name}/CRONS.json`. Registered at startup by `scheduler.py`; manageable via the `cron_*` orchestrator tools.
+
+**Job types:**
+
+| Type | What happens |
+|---|---|
+| `remind` | Appends to `REMINDERS.md` — surfaced in context at tier 2+ |
+| `note` | Appends to `SCRATCH.md` — read on demand |
+| `message` | Sends payload text to user's notification channel |
+| `brief` | Runs LLM with payload as prompt, sends response to notification channel |
+
+**`brief` example — morning briefing:**
+```json
+{
+  "label": "Morning briefing",
+  "schedule": "daily:08:00",
+  "type": "brief",
+  "payload": "Give Scott a brief good morning. Note any pending reminders or tasks due today.",
+  "enabled": true
+}
+```
+
+**Channel selection for `message`/`brief`:**
+1. `channel` field on the job (if set)
+2. `notification_channel` key in `channels.json`
+3. Auto-detect: uses `nextcloud` if configured
+
+**Schedule formats:** `hourly` | `daily` | `daily:HH:MM` | `weekly:DOW` | `weekly:DOW:HH:MM`
+
+---
+
+## Notification Channel Config
+
+`notification_channel` in `channels.json` sets the default outbound channel for all proactive messages (distill alerts, cron message/brief jobs):
+
+```json
+{
+  "notification_channel": "nextcloud",
+  ...
+}
+```
+
+If absent, defaults to `nextcloud` if configured. Currently only NC Talk is supported for outbound; Google Chat outbound is a future item.
+
+---
+
+## Future Channels
+
+- **WhatsApp** — Business API or bridge (not started; needs account)
+- **Gitea webhooks** — push/PR/issue events → orchestrator (router pattern exists; add `gitea.py`)
+- **Aether platform events** — trigger agent actions from business data changes
--- a/documentation/ARCH__FUTURE.md
+++ b/documentation/ARCH__FUTURE.md
@@ -0,0 +1,192 @@
+# Architecture: Planned Features
+
+> What's next and how it's designed to work.
+> Last updated: 2026-04-04
+
+For the current task list see `TODO__Agents.md`. For phases and priorities see `ROADMAP.md`.
+
+---
+
+## 1. Local Orchestrator
+
+**Status:** High priority — design complete, not yet built.
+
+Same ReAct tool loop as the Gemini API orchestrator, but driven by a local model via Open WebUI's OpenAI-compatible API. Enables offline/private agent tasks with no API cost.
+
+**Why local models work for this now:** Gemma 4 E4B and 26B A4B both support OpenAI `tools` / `tool_choice` function calling. The tool schema is nearly identical to Gemini's `FunctionDeclaration` — minor field renaming only.
+
+**Design:**
+```
+POST /orchestrate  (mode: "local")
+    ↓
+local_orchestrator_engine.py
+    • converts tools/ to OpenAI tools format
+    • POST /api/chat/completions with tools array
+    • parse tool_calls response
+    • execute tool, append result
+    • loop until finish_reason: "stop"
+    ↓
+response returned (local model generates final answer)
+```
+
+Model selection:
+- **Gemma 4 E4B** (25 t/s, 72k ctx) — interactive/fast tasks
+- **Gemma 4 26B A4B** (9 t/s, 50k ctx) — heavier reasoning, background tasks
+
+Context budget per iteration (system prompt + memory + tool results + history):
+- Small model: budget ~40-50k tokens per round
+- Medium model: budget ~35-40k tokens per round
+
+Full API reference: [`docs/OPEN_WEBUI_API.md`](../docs/OPEN_WEBUI_API.md)
+
+---
+
+## 2. Dev Agent Pipeline
+
+**Status:** Design complete, not yet built.
+
+Accept a plain-English task, implement code changes, verify them, and present for human approval before committing.
+
+```
+Task (chat / Gitea issue / Kanban)
+    ↓
+Orchestrator — reads relevant files, routes to specialist
+    ↓
+Specialist Agent (Claude CLI in project directory)
+    • implements the change
+    • runs self-check: py_compile / svelte-check
+    ↓
+Supervisor Agent
+    • reviews the diff
+    • runs test suite
+    • returns: PASS / NEEDS_REVIEW / FAIL + reason
+    ↓
+Human approval gate
+    • summary in Cortex UI or NC Talk
+    • approve → commit (+ optional push)
+    • reject <20><> feedback back to specialist
+```
+
+**Specialists** (both Claude CLI):
+- **Frontend** — working dir: `~/OSIT_dev/aether_app_sveltekit/` — runs `svelte-check` after every change
+- **Backend** — working dir: `~/OSIT_dev/aether_api_fastapi/` — runs `py_compile` + unit tests
+
+**Supervisor** returns structured JSON:
+```json
+{
+  "verdict": "PASS | NEEDS_REVIEW | FAIL",
+  "checks_passed": ["py_compile"],
+  "checks_failed": [],
+  "review_notes": "...",
+  "commit_message": "..."
+}
+```
+
+---
+
+## 3. Gitea Integration
+
+**Status:** Not started. pfSense port forward for SSH already confirmed working.
+
+- **Webhooks → Cortex:** push/PR/issue events → `POST /webhook/gitea` → orchestrator
+  - Router pattern already established; add `cortex/routers/gitea.py`
+- **Gitea Actions CI:** `.gitea/workflows/check.yml` — run `py_compile`/`svelte-check` on push
+- **Cortex → Gitea:** after human approval, call Gitea API to create PR or push branch
+
+SSH clone/push: `git clone ssh://git@git.dgrzone.com:2222/<user>/<repo>.git`
+
+---
+
+## 4. Knowledge Layer (AE Journals)
+
+**Status:** Tools exist, import script not yet built.
+
+AE Journals becomes the searchable long-term knowledge base. Complements memory distillation: memory files cover "what have we been working on lately"; Journals cover "what do I know about topic X".
+
+**Existing tools:** `ae_journal_search`, `ae_journal_entry_create` — already in orchestrator tool suite.
+
+**Import script (to build):**
+- Walk a markdown directory (Nextcloud, agents_sync docs)
+- Chunk by H2 section
+- Search before creating (deduplication)
+- Tag from frontmatter, filename, directory path
+- Target sources: `~/DgrZone_Nextcloud/`, `~/OSIT_Nextcloud/`
+
+**Agent workflow:**
+```
+"Summarize my notes on WireGuard setup"
+    → orchestrator calls ae_journal_search("wireguard")
+    → returns matching entries
+    → Claude synthesizes response
+```
+
+---
+
+## 5. Intelligent Model Routing
+
+**Status:** Deferred. Currently user-toggled.
+
+Route automatically based on task characteristics rather than requiring manual backend selection:
+
+| Task type | Backend | Reason |
+|---|---|---|
+| User-facing conversation | Claude | Quality prose, persona fidelity |
+| Tool use / orchestration | Gemini API | Native function calling, free tier |
+| Private / sensitive / offline | Local (Ollama) | No data leaves the network |
+| Long context (>50k tokens) | Gemini 2.0 | 1M token context window |
+| Fast/cheap simple queries | Local (E4B) | 25 t/s, no API cost |
+
+Routing logic would live in `llm_client.py` or a new `router.py` — map task metadata to backend choice.
+
+---
+
+## 6. RAG via Open WebUI
+
+**Status:** Future — Open WebUI already supports it.
+
+Feed Nextcloud documents or session logs into Open WebUI knowledge collections. Reference them in local model chat via `"files": [{"type": "collection", "id": "..."}]`.
+
+Would complement AE Journals for local-only contexts where data shouldn't leave the network.
+
+API reference: [`docs/OPEN_WEBUI_API.md`](../docs/OPEN_WEBUI_API.md) — RAG section.
+
+---
+
+## 8. Agent Architecture Ideas (from Claude Code leak)
+
+**Status:** Research — review before building dev agent pipeline and orchestrator.
+
+The Claude Code system prompt was leaked in early April 2026. Two reimplementation repos are worth reading for design ideas before building out the dev agent pipeline and local orchestrator:
+
+- https://github.com/HarnessLab/claw-code-agent — Python reimplementation targeting local models (Qwen3-Coder recommended); most technically detailed
+- https://github.com/ultraworkers/claw-code — Community porting/reverse-engineering project; reportedly has interesting detail in the source code itself
+
+**Ideas worth incorporating:**
+
+**Tiered permission architecture** — explicit read-only / write / shell / unsafe modes, each requiring an opt-in flag. Currently Cortex has implicit trust for agent operations. Relevant once the dev agent pipeline is writing and executing code — don't want a `brief` cron job accidentally in write mode.
+
+**Agent lineage tracking** — agent manager records which agent spawned which sub-agent. Useful for debugging multi-step orchestrated tasks and essential for the supervisor → specialist → approval gate chain.
+
+**Cost/budget enforcement** — hard token and cost budgets per operation, multiple budget types. `ORCHESTRATOR_MAX_ROUNDS=10` is Cortex's only guardrail today. Worth adding a token budget check to the tool loop, especially relevant for local models with hard context ceilings (72k/50k practical).
+
+**Context compaction/snipping** — automatic mid-session context trimming when approaching limits. Important for long orchestrator runs against local models. Could trim tool results that are no longer needed for the current reasoning step.
+
+**Nested agent delegation with dependency-aware batching** — sub-agents that know their parent; parallel sub-tasks batched by dependency order. Directly applicable to the dev agent pipeline (orchestrator → specialist → supervisor, with some steps parallelizable).
+
+**File history journaling** — beyond session logs, a journal of what files changed and why, with replay summaries. Different from memory distillation — more like a git log for agent actions. Could complement the supervisor agent's diff review.
+
+**Plugin/manifest-based tool extensions** — tools declared via manifest rather than hardcoded in `__init__.py`. Would make adding new orchestrator tools less invasive. Worth considering before the tool suite grows much larger.
+
+---
+
+## 7. Permanent Fleet Hosting
+
+**Status:** Deferred.
+
+Currently running on `scott_lpt` (main laptop). Long-term target: home server (always-on, Docker).
+
+`docker-compose.yml` already exists in the project root. Deployment path:
+1. Copy to home server
+2. Configure reverse proxy (Nginx, already Docker-hosted)
+3. Set subdomain `cortex.dgrzone.com` → home server internal IP
+4. WireGuard required for all access — not internet-exposed
--- a/documentation/ARCH__Intelligence_Layer.md
+++ b/documentation/ARCH__Intelligence_Layer.md
@@ -1,306 +1,14 @@
-# Architecture: Intelligence Layer
+# ARCH__Intelligence_Layer.md — Archived

-**Status:** Design phase — not yet implemented
-**Last updated:** 2026-03-18
+This document has been split into focused per-topic docs.

-This document captures the architectural thinking behind expanding Cortex from a smart dispatcher into a genuine intelligence layer: capable of using tools, coordinating specialist agents, and managing a personal knowledge base.
+| What you're looking for | New location |
+|---|---|
+| Overall architecture, design decisions | [`ARCH__SYSTEM.md`](ARCH__SYSTEM.md) |
+| Orchestrator/Responder pattern, tool loop | [`ARCH__FUTURE.md`](ARCH__FUTURE.md) — section 1 |
+| Dev agent pipeline, supervisor agent | [`ARCH__FUTURE.md`](ARCH__FUTURE.md) — section 2 |
+| Knowledge layer, AE Journals import | [`ARCH__FUTURE.md`](ARCH__FUTURE.md) — section 4 |
+| LLM backends and routing | [`ARCH__BACKENDS.md`](ARCH__BACKENDS.md) |
+| Model routing (future) | [`ARCH__FUTURE.md`](ARCH__FUTURE.md) — section 5 |

---
-
-## Overview
-
-Cortex currently dispatches chat messages to LLM CLI backends and returns the response. The Intelligence Layer adds three major capabilities on top of that foundation:
-
-1. **Orchestrator/Responder** — Gemini handles tool use and planning; Claude handles the user-facing response
-2. **Dev Agent Pipeline** — Specialist agents implement code changes; a supervisor checks the work
-3. **Knowledge Layer** — AE Journals becomes the primary knowledge base; agents can read and write it
-
-These are independent tracks that share the same trigger layer and can be built incrementally.
-
---
-
-## 1. Orchestrator / Responder Pattern
-
-### The Problem
-
-Claude CLI (via Pro subscription) doesn't expose direct API tool-calling. Gemini API (free tier) does. But Claude produces higher-quality user-facing prose and reasoning. The solution is to use each model for what it does best.
-
-### The Pattern
-
-```
-User message
-    ↓
-Orchestrator (Gemini API)
-    • interprets intent
-    • decides which tools to call
-    • executes tool loop (ReAct: reason → act → observe → repeat)
-    • assembles enriched context + tool results
-    ↓
-Responder (Claude CLI)
-    • receives enriched context
-    • writes the user-facing response
-    ↓
-User
-```
-
-For **direct chat** (no tools needed), the orchestrator is bypassed entirely — message goes straight to Claude. The orchestrator only activates when tools are required or when explicitly invoked (e.g., a background task).
-
-### Why Gemini API (not CLI)?
-
- Gemini CLI is a subprocess; function calling via subprocess is fragile
- Gemini API (`google-generativeai` SDK) has native structured tool-calling
- Free tier (Gemini 2.0 Flash) handles orchestration load without cost
- Access token is short-lived but auto-refreshed by the SDK (no expiry problem)
-
-### Tool Strategy
-
-Tools for the orchestrator are **separate** from the existing `ae_*` MCP tools. The ae_* tools are stable and used by existing agents — do not modify them.
-
-New orchestrator tools are Python functions wrapped in Gemini function declarations:
-
-| Tool | What it does | Implementation |
-|---|---|---|
-| `web_search` | DuckDuckGo search | `duckduckgo-search` library |
-| `ae_journal_search` | Search AE Journals via V3 API | HTTP to AE API |
-| `ae_journal_entry_create` | Write a new journal entry | HTTP to AE API |
-| `ae_task_list` | Read Kanban tasks | HTTP to AE API or agents_sync file |
-| `file_read` | Read a file from known safe paths | Python `pathlib` |
-| `gitea_api` | Query Gitea repos, issues, PRs | Gitea REST API |
-
-Tools are registered in `cortex/tools/` (one file per domain group).
-
-### Implementation Path
-
-```
-cortex/
-  tools/
-    __init__.py          — tool registry
-    web.py               — web_search
-    ae_knowledge.py      — ae_journal_* tools
-    ae_tasks.py          — task tools
-    gitea.py             — Gitea API tools
-  routers/
-    orchestrator.py      — POST /orchestrate, GET /orchestrate/{job_id}
-  orchestrator_engine.py — Gemini tool loop + Claude handoff
-```
-
-Endpoint contract:
-
-```
-POST /orchestrate
-{
-  "task": "What tasks are due this week and summarize my notes on X topic",
-  "session_id": "optional — if part of an ongoing conversation",
-  "respond_with_claude": true   // false = return Gemini's assembled context only
-}
-
-→ { "job_id": "uuid", "status": "queued" }
-
-GET /orchestrate/{job_id}
-→ { "status": "complete", "result": "...", "tool_calls": [...] }
-```
-
---
-
-## 2. Trigger Layer
-
-All three capabilities (chat, orchestration, dev agents) share the same trigger layer:
-
-```
-┌────────────────────────────────────────────────┐
-│  TRIGGERS                                      │
-│                                                │
-│  Chat UI  →  POST /chat  (existing)            │
-│  Cron     →  POST /orchestrate  (new)          │
-│  Gitea    →  POST /webhook/gitea  (new)        │
-│  NC Talk  →  POST /webhook/nextcloud  (exists) │
-│  Manual   →  CLI / curl for debugging          │
-└────────────────────────────────────────────────┘
-```
-
-Cron trigger example (from existing cron infrastructure):
-
-```bash
-curl -X POST http://localhost:8000/orchestrate \
-  -H "Content-Type: application/json" \
-  -d '{"task": "Check for overdue Kanban tasks and notify via NC Talk"}'
-```
-
-This means the same orchestrator endpoint is usable from chat, crons, and webhooks without any special cases.
-
---
-
-## 3. Dev Agent Pipeline
-
-### The Goal
-
-Accept a plain-English task like *"Fix the bug where X, add a test for it"* and produce:
- A working code change
- Passing syntax/type checks
- A summary of what changed and what still needs human review
- A commit ready to push (pending approval)
-
-### Architecture
-
-```
-Task request (chat / Gitea issue / Kanban)
-    ↓
-Orchestrator
-    • reads relevant files (context gathering)
-    • routes to correct specialist
-    ↓
-Specialist Agent (Claude CLI in project directory)
-    • implements the change
-    • runs self-check: py_compile / svelte-check
-    ↓
-Supervisor Agent
-    • reviews the diff
-    • runs test suite
-    • returns: PASS / NEEDS_REVIEW / FAIL + reason
-    ↓
-Human approval gate
-    • summary shown in Cortex UI or NC Talk
-    • user approves → commit + optional push
-    • user rejects → feedback goes back to specialist
-```
-
-### Specialist Agents
-
-Two initial specialists, both using Claude CLI:
-
-**Frontend specialist** (working dir: `~/OSIT_dev/aether_app_sveltekit/`):
- Reads `documentation/TODO__Agents.md` and `CLAUDE.md` before acting
- Runs `npx svelte-check` after every change — no exceptions
- Atomic commits (one component or fix per commit)
-
-**Backend specialist** (working dir: `~/OSIT_dev/aether_api_fastapi/`):
- Reads `documentation/TODO__Agents.md` and `CLAUDE.md` before acting
- Runs `python3 -m py_compile` after every file edit
- Runs unit tests before declaring done
- Flags E2E tests that need human review
-
-### Supervisor Agent
-
-The supervisor is a separate Claude invocation that receives:
- The diff of all changed files
- Stdout/stderr from all checks that were run
- The original task description
-
-It returns a structured assessment:
-
-```json
-{
-  "verdict": "PASS | NEEDS_REVIEW | FAIL",
-  "checks_passed": ["py_compile", "unit_tests"],
-  "checks_failed": [],
-  "review_notes": "E2E tests not run — touch auth router, recommend manual check",
-  "commit_message": "fix: correct session token validation in auth middleware"
-}
-```
-
-### Gitea Integration
-
- **Gitea webhooks → Cortex:** Push/PR events trigger supervisor review automatically
- **Gitea Actions:** Run `py_compile`/`svelte-check` on every push (simple CI, no custom runner)
- **Cortex → Gitea:** After human approval, supervisor calls Gitea API to create PR or push
-
-Gitea Actions are simpler than they sound — a `.gitea/workflows/check.yml` is just a YAML file that runs shell commands on push. No external CI infrastructure needed.
-
---
-
-## 4. Knowledge Layer
-
-### The Goal
-
-AE Journals becomes the primary source of truth for personal and business knowledge. Notes, documentation, and logs that currently live scattered across markdown files get organized into Journals with proper structure, search, and agent-accessible read/write.
-
-### Import Strategy
-
-1. **Don't bulk-import blindly.** The orchestrator searches AE Journals before creating anything (deduplication).
-2. **Chunk by section.** A large markdown file becomes multiple journal entries — one per H2 section.
-3. **Preserve provenance.** Each imported entry includes source path, import date, and original file date in its `data_json` or notes.
-4. **Tag intelligently.** Tags come from: frontmatter, filename keywords, directory path, and content analysis.
-
-### Source Priority
-
-| Source | Priority | Notes |
-|---|---|---|
-| `~/DgrZone_Nextcloud/` | High | Personal notes, projects |
-| `~/OSIT_Nextcloud/` | High | Business docs |
-| `~/agents_sync/aether/docs/` | Medium | Platform specs (already structured) |
-| OpenClaw session logs | Low | Historical, lots of noise |
-
-### Agent Workflow
-
-```
-"Summarize my notes on WireGuard setup"
-    ↓
-Orchestrator calls ae_journal_search("wireguard")
-    ↓
-Returns matching entries
-    ↓
-Claude synthesizes a response
-```
-
-```
-"Save this as a note in my DgrZone journal"
-    ↓
-Orchestrator calls ae_journal_entry_create(
-    journal="DgrZone General",
-    title="...",
-    content="...",
-    tags=["note", "wireguard"]
-)
-```
-
-### Context Tiers (Inara Memory)
-
-The existing distill system (`MEMORY_SHORT.md`, `MEMORY_MID.md`, `MEMORY_LONG.md`) handles working memory. The Knowledge Layer is complementary — it's the **searchable long-term archive**, not the rolling context window. Agents should:
-
- Use memory files for "what have we been working on lately"
- Use AE Journals search for "what do I know about topic X"
-
---
-
-## 5. Model Routing (Future)
-
-Currently hardcoded: Claude default, Gemini fallback. Future intelligent routing:
-
-| Task type | Model | Reason |
-|---|---|---|
-| User-facing conversation | Claude | Quality prose, reasoning |
-| Tool use / orchestration | Gemini API | Native function calling, free |
-| Private / sensitive | Ollama (local) | No data leaves the network |
-| Long context (>100k tokens) | Gemini 2.0 | 1M token context window |
-| Code generation | Claude | Strong code quality |
-
-Routing logic lives in `cortex/orchestrator_engine.py` — a simple function that maps task metadata to a backend choice.
-
---
-
-## Implementation Order (Recommended)
-
-1. **Orchestrator Phase 1** — Gemini API integration, basic tool loop, `/orchestrate` endpoint
-   - Unlocks: web search in chat, AE Journal queries, cron-triggered tasks
-2. **Knowledge import** — markdown → AE Journal Entries tool + import script
-   - Unlocks: searchable knowledge base for all agents
-3. **Dev agent pipeline** — Frontend + Backend specialist agents
-   - Unlocks: AI-assisted development with supervisor review
-4. **Gitea integration** — webhook receiver + Actions CI
-   - Unlocks: event-driven automation, PR workflow
-5. **Intelligent routing** — model selection by task type
-   - Polish: cost and quality optimization
-
---
-
-## Key Design Decisions
-
-| Decision | Choice | Rationale |
-|---|---|---|
-| Orchestrator model | Gemini API (not CLI) | Native tool calling; free tier |
-| Responder model | Claude CLI (Pro sub) | Quality output; no API cost |
-| Direct chat bypass | Yes | Don't add latency when tools aren't needed |
-| Tool set | Separate from ae_* MCPs | ae_* tools are stable; don't risk breaking active agents |
-| Dev agents | Claude CLI in project dir | CLAUDE.md + project context already in place |
-| Human approval gate | Required before commit | Agents can propose; humans decide |
-| Knowledge primary source | AE Journals | Already exists, structured, searchable |
+*Original content written 2026-03-18. Superseded 2026-04-03.*
--- a/documentation/ARCH__PERSONA.md
+++ b/documentation/ARCH__PERSONA.md
@@ -0,0 +1,121 @@
+# Architecture: Persona System & Memory
+
+> How Inara (and other personas) know who they are and what they remember.
+> Last updated: 2026-04-03
+
+---
+
+## Filesystem Layout
+
+Each persona lives in `home/{username}/persona/{name}/`:
+
+```
+home/scott/persona/inara/
+  IDENTITY.md       Who Inara is — role, name, origin
+  SOUL.md           Values, personality, voice, what she cares about
+  PROTOCOLS.md      Behavioral rules — how she responds, what she avoids
+  CONTEXT_TIERS.md  Documents which files load at each tier
+  USER.md           Scott's profile — loaded into context so she knows who she's talking to
+  HELP.md           Persona-specific help content (appended to shared HELP.md in UI)
+  MEMORY_SHORT.md   Recent session digest (auto-distilled daily)
+  MEMORY_MID.md     Mid-term summary (auto-distilled weekly)
+  MEMORY_LONG.md    Long-term memory (auto-distilled monthly)
+  REMINDERS.md      Pending reminders (auto-surfaced at tier 2+)
+  SCRATCH.md        Ephemeral scratchpad (read/write via tools)
+  TASKS.json        Personal task list (managed via tools)
+  CRONS.json        Scheduled jobs (managed via tools)
+  sessions/         Session turn logs — YYYY-MM-DD.md, one file per day
+```
+
+**ContextVars:** `persona.py` sets `_user` and `_persona` ContextVars per request. Everything downstream calls `persona_path()` to resolve the right directory — no globals, no thread-local state.
+
+---
+
+## Context Tiers
+
+Each chat request specifies a tier (default: 2). Higher tiers load more context — slower but richer.
+
+| Tier | Loaded Files | Use case |
+|---|---|---|
+| 1 | IDENTITY.md | Minimal — lightweight tasks |
+| 2 | + SOUL.md, PROTOCOLS.md, USER.md, MEMORY_SHORT.md, MEMORY_MID.md, REMINDERS.md | Standard chat |
+| 3 | + MEMORY_LONG.md, CONTEXT_TIERS.md | Deep sessions, long tasks |
+| 4 | + SCRATCH.md, TASKS.json | Full state — agent mode |
+
+`context_loader.py` assembles the system prompt from these files in order. The resulting prompt is passed to whichever LLM backend handles the request.
+
+---
+
+## Memory Distillation
+
+Three-tier rolling memory system, run by APScheduler:
+
+```
+sessions/YYYY-MM-DD.md  ← raw session logs (written by session_logger.py)
+        ↓ daily 03:00
+MEMORY_SHORT.md         ← recent session digest (no LLM — pure aggregation)
+        ↓ weekly Sun 03:30
+MEMORY_MID.md           ← concise summary (LLM)
+        ↓ monthly 1st 04:00
+MEMORY_LONG.md          ← integrated long-term memory (LLM)
+```
+
+**Short distill** — reads the most recent session files that fit within the token budget, writes them in chronological order. No LLM involved — fast and cheap.
+
+**Mid distill** — LLM summarizes MEMORY_SHORT into a concise digest. Prompt asks for recurring themes, decisions, ongoing projects, Scott's current state and priorities. Written in first person as Inara.
+
+**Long distill** — LLM integrates MEMORY_MID into MEMORY_LONG. Rules: preserve historical facts, update stale info, absorb new themes, remove irrelevant entries.
+
+**Distill notifications** — after mid and long runs, `notification.py` sends a message to the user's configured NC Talk notification room (if `notification_room` is set in `channels.json`).
+
+**Controls** in `.env`:
+```
+AUTO_DISTILL=true
+AUTO_DISTILL_SHORT=true
+AUTO_DISTILL_MID=true
+AUTO_DISTILL_LONG=true          # off by default — first run warrants manual review
+DISTILL_BACKEND_MID=local       # use local model to save API credits
+DISTILL_BACKEND_LONG=           # empty = primary backend (claude recommended)
+MEMORY_BUDGET_SHORT=3000        # token budgets (soft caps)
+MEMORY_BUDGET_MID=2000
+MEMORY_BUDGET_LONG=2000
+```
+
+Manual distill via API:
+```
+POST /distill/short
+POST /distill/mid
+POST /distill/long
+GET  /distill/status
+```
+
+---
+
+## Adding a New Persona
+
+`persona_template.py` bootstraps a new persona directory from string templates. The onboarding flow (`/setup/persona`) calls this when a new user creates their first persona.
+
+To add one manually:
+1. Create `home/{username}/persona/{name}/`
+2. Copy and edit the files from an existing persona (e.g. `home/scott/persona/inara/`)
+3. At minimum: `IDENTITY.md`, `SOUL.md`, `PROTOCOLS.md`, `USER.md`
+4. The distiller will create the `MEMORY_*.md` files on first run
+
+---
+
+## Session Search
+
+Past sessions are searchable via `GET /sessions/search?q=...&user=...&persona=...`.
+
+Available in the UI via the search box at the bottom of the Files panel (open with the Files button). Results are grouped by date with highlighted excerpts.
+
+---
+
+## Active Personas
+
+| User | Persona | Description |
+|---|---|---|
+| scott | inara | Scott's primary assistant |
+| scott | developer | Dev-focused persona |
+| holly | tina | Holly's primary assistant |
+| brian | wintermute | Brian's primary assistant |
--- a/documentation/ARCH__SYSTEM.md
+++ b/documentation/ARCH__SYSTEM.md
@@ -0,0 +1,90 @@
+# Architecture: System Overview
+
+> How the pieces fit together.
+> Last updated: 2026-04-03
+
+---
+
+## Architecture Diagram
+
+```
+┌─────────────────────────────────────────────────────────┐
+│  INPUT CHANNELS                                         │
+│                                                         │
+│  Web UI ──────────────────────────────────────────┐     │
+│  Nextcloud Talk ──── POST /webhook/nextcloud/{u} ─┤     │
+│  Google Chat ─────── POST /channels/google-chat/{u}┤    │
+│  Cron / Scheduler ─────────────────────────────────┤    │
+│  Webhooks (future) ─────────────────────────────────┘   │
+└─────────────────────────────┬───────────────────────────┘
+                              ↓
+┌─────────────────────────────────────────────────────────┐
+│  CORTEX DISPATCHER  (FastAPI — cortex/)                 │
+│                                                         │
+│  auth_middleware.py  → validates JWT session cookie     │
+│  persona.py          → resolves user + persona context  │
+│  context_loader.py   → assembles system prompt (tier 1-4)│
+│                                                         │
+│  POST /chat          → direct LLM, streaming SSE        │
+│  POST /orchestrate   → Gemini tool loop → Claude        │
+│  GET  /orchestrate/{id} → poll job result               │
+└────────────┬───────────────────┬────────────────────────┘
+             ↓                   ↓
+┌─────────────────┐   ┌──────────────────────────────────┐
+│  LLM BACKENDS   │   │  PERSONA DATA                    │
+│                 │   │  home/{user}/persona/{name}/      │
+│  Claude CLI     │   │                                  │
+│  Gemini CLI     │   │  IDENTITY.md  SOUL.md            │
+│  Gemini API     │   │  PROTOCOLS.md MEMORY_*.md        │
+│  Local (httpx)  │   │  USER.md  REMINDERS.md           │
+│                 │   │  TASKS.json  CRONS.json          │
+└─────────────────┘   │  sessions/  SCRATCH.md          │
+                      └──────────────────────────────────┘
+```
+
+Details: [`ARCH__BACKENDS.md`](ARCH__BACKENDS.md) | [`ARCH__PERSONA.md`](ARCH__PERSONA.md) | [`ARCH__CHANNELS.md`](ARCH__CHANNELS.md)
+
+---
+
+## Service Layout (`cortex/`)
+
+| File | Purpose |
+|---|---|
+| `main.py` | App entry point, router registration |
+| `config.py` | All settings (pydantic-settings, reads `.env`) |
+| `persona.py` | User + persona path resolution, ContextVars |
+| `context_loader.py` | Builds system prompt from persona files (tiers 1–4) |
+| `llm_client.py` | All LLM backends — Claude, Gemini CLI, Local |
+| `orchestrator_engine.py` | Gemini API ReAct tool loop → Claude handoff |
+| `session_store.py` | In-memory + file session persistence |
+| `session_logger.py` | Writes session turns to `sessions/YYYY-MM-DD.md` |
+| `memory_distiller.py` | Short/mid/long distill jobs |
+| `scheduler.py` | APScheduler — distill jobs + user crons |
+| `cron_runner.py` | Cron job storage, schedule parsing, execution |
+| `notification.py` | Outbound channel messages (distill alerts, cron proactive) |
+| `auth_utils.py` | bcrypt passwords, JWT, invite tokens, channel config |
+| `auth_middleware.py` | JWT cookie validation on all routes |
+| `user_settings.py` | Per-user local LLM config (hosts, models, active model) |
+| `event_bus.py` | Internal SSE pub/sub (NC Talk → browser mirror) |
+| `email_utils.py` | SMTP invite emails |
+| `persona_template.py` | Bootstrap a new persona directory from templates |
+| `routers/` | One file per endpoint group (chat, orchestrator, auth, files, channels, ui, settings…) |
+| `tools/` | Orchestrator tool implementations (web, ae_knowledge, tasks, scratch, reminders, cron, system) |
+| `static/` | Web UI — `index.html`, `app.js`, `style.css`, `login.html`, `setup.html`, `HELP.md` |
+| `tests/` | pytest suite (80 tests) |
+
+---
+
+## Key Design Decisions
+
+**Two-brain pattern** — Gemini API handles tool use (function calling, planning, web search). Claude CLI handles all user-facing responses. Direct chat bypasses the orchestrator entirely.
+
+**Subprocess backends** — Claude and Gemini run as CLI subprocesses (`claude --print`, `gemini -p`). This keeps auth transparent (Claude Code manages tokens) and avoids API costs on the Pro subscription path.
+
+**Local backend via httpx** — Open WebUI's OpenAI-compatible API (`/api/chat/completions`). No CLI wrapper. Per-user host + model config in `local_llm.json`.
+
+**ContextVars for async isolation** — `persona.py` uses Python `contextvars.ContextVar` so concurrent requests each see their own user/persona without thread-local hacks.
+
+**Per-user filesystem layout** — `home/{user}/persona/{name}/` mirrors Linux home directories. Each persona is a directory of markdown files and JSON. No database. Easy to inspect, edit, and back up.
+
+**No single point of coupling** — tools live in `cortex/tools/`, separate from `ae_*` MCP tools. Channels live in `cortex/routers/`, each self-contained. Adding a channel or tool doesn't touch other subsystems.
--- a/documentation/MASTER.md
+++ b/documentation/MASTER.md
@@ -0,0 +1,92 @@
+# Cortex / Inara — Master Index
+
+> Start here. This document is a map, not a manual.
+> Last updated: 2026-04-03
+
+---
+
+## What It Is
+
+Cortex is a self-hosted personal AI platform. It routes messages from any input channel to AI backends, manages a resident agent (Inara) with persistent memory, and coordinates across a fleet of machines. It is infrastructure, not a product.
+
+**Running at:** `https://cortex.dgrzone.com` | `systemctl --user restart cortex`
+
+---
+
+## Current State
+
+| Component | Status | Notes |
+|---|---|---|
+| Web UI | ✅ Live | SPA, dark theme, mobile-responsive, session auth |
+| Nextcloud Talk bot | ✅ Live | HMAC-signed, per-user routing |
+| Google Chat Add-on | ✅ Live | JWT-verified, per-user routing |
+| Claude backend | ✅ Live | Primary — via Claude Code CLI |
+| Gemini backend | ✅ Live | Fallback — via Gemini CLI |
+| Local backend | ✅ Live | Third option — Open WebUI/Ollama on scott_gaming |
+| Gemini orchestrator | ✅ Live | Tool loop → Claude response, Agent mode in UI |
+| Memory distillation | ✅ Live | Short (daily) / Mid (weekly) / Long (monthly) |
+| Multi-user | ✅ Live | Scott, Holly, Brian — each with own personas |
+| Session search | ✅ Live | Full-text search across past session logs |
+| Proactive cron | ✅ Live | `message` and `brief` job types → NC Talk |
+
+**Active users / personas:** scott/inara, scott/developer, holly/tina, brian/wintermute
+
+---
+
+## Document Map
+
+### Project-Level
+| Doc | What it covers |
+|---|---|
+| **This file** | Index and current state |
+| [`CORTEX.md`](../CORTEX.md) | Vision, philosophy, "what it is and isn't" |
+| [`ROADMAP.md`](ROADMAP.md) | Phases — what's done, what's next, what's deferred |
+| [`TODO__Agents.md`](TODO__Agents.md) | Active task list — read before starting work |
+
+### Architecture
+| Doc | What it covers |
+|---|---|
+| [`ARCH__SYSTEM.md`](ARCH__SYSTEM.md) | Overall architecture, component map, key design decisions |
+| [`ARCH__BACKENDS.md`](ARCH__BACKENDS.md) | LLM backends, routing, fallback, per-user config |
+| [`ARCH__PERSONA.md`](ARCH__PERSONA.md) | Persona system, context tiers, memory distillation |
+| [`ARCH__CHANNELS.md`](ARCH__CHANNELS.md) | Input channels — web, NC Talk, Google Chat, cron |
+| [`ARCH__FUTURE.md`](ARCH__FUTURE.md) | Planned: local orchestrator, dev agents, knowledge layer |
+
+### Setup & Reference
+| Doc | What it covers |
+|---|---|
+| [`docs/NEXTCLOUD_TALK_BOT.md`](../docs/NEXTCLOUD_TALK_BOT.md) | NC Talk bot setup and troubleshooting |
+| [`docs/GOOGLE_CHAT_BOT.md`](../docs/GOOGLE_CHAT_BOT.md) | Google Chat Add-on setup |
+| [`docs/OPEN_WEBUI_API.md`](../docs/OPEN_WEBUI_API.md) | Open WebUI/Ollama API reference for local model work |
+
+### Code-Level
+| Doc | What it covers |
+|---|---|
+| [`CLAUDE.md`](../CLAUDE.md) | Project instructions for Claude Code — directory map, run commands, design decisions |
+| [`README.md`](../README.md) | Project root orientation, quick-start, user management |
+| [`cortex/static/HELP.md`](../cortex/static/HELP.md) | In-app help (rendered in UI for all users) |
+
+---
+
+## Quick Reference
+
+**Start the service / check logs**
+```bash
+systemctl --user restart cortex
+journalctl --user -u cortex -f
+```
+
+**Syntax check before restart**
+```bash
+python3 -m py_compile cortex/<file>.py
+```
+
+**Add a user**
+```bash
+cd cortex && .venv/bin/python manage_passwords.py invite <username> <email>
+```
+
+**Run tests**
+```bash
+cd cortex && .venv/bin/python -m pytest tests/ -q
+```
--- a/documentation/ROADMAP.md
+++ b/documentation/ROADMAP.md
@@ -0,0 +1,71 @@
+# Cortex — Roadmap
+
+> Phases and priorities. For active tasks see `TODO__Agents.md`.
+> Last updated: 2026-04-03
+
+---
+
+## Phase 0 — Foundation ✅
+- Syncthing fleet sync (`agents_sync/`) operational
+- MCP tools (`ae_*`) available in all Claude Code sessions
+- Fleet agents running independently on each machine
+
+## Phase 1 — Dispatcher Core ✅
+- FastAPI service with streaming SSE responses
+- Claude CLI and Gemini CLI subprocess backends
+- Session context management (rolling window, file persistence)
+- Nextcloud Talk bot (HMAC-signed webhook)
+- Memory distiller (APScheduler — short/mid/long cycles)
+- Local web UI (single-page, mobile-responsive)
+- Auth status monitoring (`/auth/status`, UI banner)
+- Session logging and file browser
+
+## Phase 2 — Identity & Multi-User ✅
+- Inara persona formalized (`IDENTITY.md`, `SOUL.md`, `PROTOCOLS.md`, context tiers)
+- Two-level user/persona layout (`home/{user}/persona/{name}/`)
+- Session auth: bcrypt passwords, JWT cookies, invite tokens, Google OAuth
+- Multi-user live: Scott, Holly, Brian
+- Per-user channel config (`channels.json`)
+- Per-user Gemini API key (settings UI)
+- Help & Reference system (shared base + per-persona additions)
+- Lucide icons, persona picker page, session persistence across navigation
+
+## Phase 3 — Intelligence Layer (In Progress)
+- ✅ Gemini API orchestrator (tool loop → Claude responder)
+- ✅ Tool suite: web search, AE Journal read/write, tasks, scratch, reminders, cron, system
+- ✅ Agent mode in UI (async job, poll for result)
+- ✅ Local LLM backend (Open WebUI/Ollama, per-user multi-model config)
+- ✅ Proactive cron (`message` / `brief` job types → NC Talk)
+- ✅ Session search (full-text across past session logs)
+- ✅ Distill notifications (NC Talk after mid/long runs)
+- ✅ Local backend for distillation (DISTILL_BACKEND_MID/LONG in .env)
+- [ ] **Local orchestrator** — ReAct tool loop using local model (High priority — see `TODO__Agents.md`)
+- [ ] Knowledge import — markdown → AE Journals (import script)
+- [ ] Dev agent pipeline — specialist agents + supervisor + approval gate
+- [ ] Gitea webhook integration + Actions CI
+
+## Phase 4 — Channel Expansion
+- ✅ Web UI
+- ✅ Nextcloud Talk
+- ✅ Google Chat
+- [ ] WhatsApp (Business API or bridge — investigating)
+- [ ] Webhook triggers from Aether platform events
+
+## Phase 5 — Routing Intelligence & Scale
+- [ ] Intelligent model routing (by task type, privacy, context length)
+- [ ] Agent-to-agent task delegation across fleet
+- [ ] Permanent hosting on home server (currently on `scott_lpt`)
+
+## Phase 6 — Infrastructure
+- [ ] Server DMZ finalized
+- [ ] WireGuard for all Cortex-accessing devices
+- [ ] Camera/IoT VLAN segmentation
+
+---
+
+## Deferred / Watching
+- **Unsloth Gemma 4 GGUFs** — blocked on Ollama v0.20.1 (llama.cpp GGUF metadata issue); switch `agent-support-gemma-*` aliases to Unsloth Q4_K_M when ready
+- **Speculative decoding** — llama.cpp supports it (E4B + E2B draft ≈ 2x speed); Ollama does not yet
+- **RAG via Open WebUI** — feed Nextcloud docs into local knowledge collections; possible complement to AE Journals search
+- **Multi-host local models** — per-user config already supports multiple hosts; routing logic TBD
+- **WhatsApp** — requires Business API account or a bridge; not started
--- a/documentation/TODO__Agents.md
+++ b/documentation/TODO__Agents.md
@@ -7,16 +7,21 @@

 ## 🔴 High Priority

-### [Backend] Ollama local model backend
- Add Ollama as a third LLM backend option (direct Ollama API, no CLI wrapper)
- Endpoint: `http://scott-gaming:<port>/api/` (WireGuard)
- Model selection: configurable per-request or per-session
- Auth status check: ping `/api/tags` to confirm reachability
+### [Local] Tool-capable local orchestrator
+Design and implement `local_orchestrator_engine.py` — a ReAct tool loop driven by
+a local model via Open WebUI's OpenAI-compatible API, as an alternative to the
+Gemini API orchestrator for private/offline tasks.

-### [Testing] Gitea SSH port 2222 ✅ — 2026-03-29
- pfSense WAN → 192.168.32.7:2222 port forward confirmed working
- `ssh -p 2222 git@git.dgrzone.com` reaches Gitea (returns "Invalid repository path" — expected, confirms connectivity)
- Clone/push via SSH: `git clone ssh://git@git.dgrzone.com:2222/<user>/<repo>.git`
+- [ ] Convert existing Cortex tool definitions (`cortex/tools/`) from Gemini
+      `FunctionDeclaration` format to OpenAI `tools` format (minor schema diff)
+- [ ] Implement tool loop: send tools → parse `tool_calls` response → execute →
+      append result → loop until `finish_reason: stop`
+- [ ] Wire into `routers/orchestrator.py` — new `mode` param: `"local"` vs `"gemini"`
+- [ ] UI: Agent mode button routes to local orchestrator when local backend active
+- [ ] Recommended models (scott_gaming, 8 GB VRAM):
+      Gemma 4 E4B — 25 t/s, 72k practical ctx — interactive/fast tasks
+      Gemma 4 26B A4B — 9 t/s, 50k practical ctx — heavier reasoning, background tasks
+- Reference: `docs/OPEN_WEBUI_API.md` for full tool call request/response format

 ---

@@ -30,15 +35,22 @@ See `ARCH__Intelligence_Layer.md` for full design.
 - [ ] Target: markdown files from `~/DgrZone_Nextcloud/` and `~/OSIT_Nextcloud/`
 - [ ] Tag strategy: source path, date, topic tags from frontmatter or filename

-### [Distill] Monitor first auto_distill_long run
- Scheduled for ~April 1 at 04:00
- Manually review `inara/MEMORY_LONG.md` output before fully trusting
- Adjust distill prompts if needed
+### [Distill] Review first auto_distill_long output — 2026-04-01
+- Ran April 1 at 04:00 as scheduled
+- Manually review `inara/MEMORY_LONG.md` — confirm quality before fully trusting
+- Adjust distill prompts in `cortex/memory_distiller.py` if needed

 ### [Distill] Distill quality review
 - Short/mid/long distill prompts live in `cortex/memory_distiller.py`
 - After first few automatic runs, review quality and tune

+### [Local] Unsloth Gemma 4 variants
+- Unsloth Dynamic 2.0 Q4_K_M GGUFs fail with `500: unable to load model` on Ollama v0.20.0
+- Root cause: Ollama's bundled llama.cpp doesn't recognize Gemma 4 GGUF architecture metadata from raw files
+- Waiting on Ollama point release (v0.20.1+) — then switch Open WebUI to Unsloth variants
+- Expected speedup: ~10–20% smaller context footprint vs baseline, same quality
+- `agent-support-gemma-small` → Unsloth E4B Q4_K_M; `agent-support-gemma-medium` → Unsloth 26B A4B Q4_K_M
+
 ---

 ## 🟢 Lower Priority / Future
@@ -61,15 +73,49 @@ See `ARCH__Intelligence_Layer.md`. Full design not yet started.
 - `cortex/routers/` already has pattern; add `gitea.py`
 - Gitea Actions (CI) for "run tests on push" — simpler than custom runner

+### [Local] RAG via Open WebUI
+Open WebUI has a full RAG pipeline (file upload → embed → knowledge collections →
+reference in chat). Could feed Nextcloud docs or session logs into a local knowledge
+base accessible to local models. Endpoints documented in `docs/OPEN_WEBUI_API.md`.
+- `/api/v1/files/` upload + `/api/v1/retrieval/process/web` for URLs
+- Reference in chat via `"files": [{"type": "collection", "id": "..."}]`
+
 ### [Backend] Intelligent model routing
- Currently hardcoded: Claude default, Gemini fallback
- Future: route by task type (code → Claude, search → Gemini, private → Ollama)
- Future: route by context length (Gemini 2.0 has 1M token context)
+- Currently hardcoded: Claude default, Gemini fallback, local third
+- Design direction (now informed by real local model perf):
+  - **Private/offline tasks** → local (Gemma 4 E4B for speed, 26B A4B for reasoning)
+  - **Complex tool tasks / long context** → Gemini (1M token context, strong function calling)
+  - **Final user-facing responses** → Claude (quality prose, persona fidelity)
+- Future: auto-route by task type rather than requiring user to toggle backend manually

 ---

 ## ✅ Completed

+### [Local] Per-user multi-model local LLM settings — 2026-04-01
+- `home/{username}/local_llm.json` — `hosts[]` + `models[]` + `active_model_id` structure
+- `cortex/user_settings.py` — CRUD functions: save_host, add_model, remove_model, set_active_model, get_active_local_model
+- `cortex/routers/local_llm.py` + `cortex/static/local_llm.html` — dedicated `/settings/local` page
+- "Fetch models from host" button — proxied via `/api/local-llm/fetch-models`, populates dropdown
+- Active model shown in UI near backend toggle button (amber hint text)
+- Migrates old flat `.env`-style config automatically on first use
+
+### [UI] Copy button for user (sent) messages — 2026-04-01
+- Added matching copy-on-hover button to user messages (same pattern as assistant messages)
+- `div.dataset.raw` set on send; `makeCopyBtn(div)` appended inline
+
+### [Backend] Local model backend (Open WebUI / Ollama) — 2026-04-01
+- OpenAI-compatible API via `httpx` — no CLI wrapper needed
+- Configured via `LOCAL_API_URL` / `LOCAL_API_KEY` / `LOCAL_MODEL` in `.env`
+- Backend toggle cycles `claude → gemini → local` (amber color in UI)
+- `/auth/status` includes local reachability check (`GET /api/models`)
+- Tested end-to-end: `test-agent-simple` (Qwen3-8B) on `scott-lt-i7-rtx:3000`, full persona context flowing correctly
+
+### [Testing] Gitea SSH port 2222 — 2026-03-29
+- pfSense WAN → 192.168.32.7:2222 port forward confirmed working
+- `ssh -p 2222 git@git.dgrzone.com` reaches Gitea (returns "Invalid repository path" — expected, confirms connectivity)
+- Clone/push via SSH: `git clone ssh://git@git.dgrzone.com:2222/<user>/<repo>.git`
+
 ### [Multi-user] Brian onboarding — 2026-03-29
 - Invite sent to `memedrift@gmail.com`
 - Brian completed onboarding, created `wintermute` persona