feat: local LLM multi-model, session search, cron proactive types, notifications, docs overhaul

Local LLM: - user_settings.py: per-user hosts/models config (local_llm.json) - routers/local_llm.py + static/local_llm.html: dedicated settings page - llm_client.py: local OpenAI-compatible backend via httpx - config.py: LOCAL_API_URL/KEY/MODEL + per-backend timeouts - Active model shown near backend toggle (amber hint text) Memory distillation: - memory_distiller.py: DISTILL_BACKEND_MID/LONG .env overrides - scheduler.py + notification.py: notify NC Talk after mid/long distill - notification.py: outbound channel abstraction (NC Talk, extensible) Session search: - routers/files.py: GET /sessions/search?q= with excerpts grouped by date - static/index.html + app.js: search UI in file sidebar with highlight - _esc() helper to prevent XSS in search results Proactive cron: - cron_runner.py: new job types — message (send directly) and brief (LLM + send) - Both support optional per-job channel override Channels: - routers/nextcloud_talk.py: consolidated using notification._send_nct_message() - routers/auth.py: local backend status in /auth/status - routers/chat.py: /backend returns {primary, fallback, local_model} object UI / UX: - Copy button for user messages (matching assistant) - Autocomplete disabled on sensitive form fields - settings.html: local model section replaced with link to /settings/local Docs overhaul: - MASTER.md hub + ARCH__SYSTEM/BACKENDS/PERSONA/CHANNELS/FUTURE.md - ARCH__Intelligence_Layer.md replaced with redirect table - CORTEX.md trimmed to vision only; README updated - OPEN_WEBUI_API.md added to docs/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:53:06 -04:00
parent bd6532e93a
commit a4daebdc9b
33 changed files with 2985 additions and 486 deletions
--- a/documentation/TODO__Agents.md
+++ b/documentation/TODO__Agents.md
@@ -7,16 +7,21 @@

 ## 🔴 High Priority

-### [Backend] Ollama local model backend
- Add Ollama as a third LLM backend option (direct Ollama API, no CLI wrapper)
- Endpoint: `http://scott-gaming:<port>/api/` (WireGuard)
- Model selection: configurable per-request or per-session
- Auth status check: ping `/api/tags` to confirm reachability
+### [Local] Tool-capable local orchestrator
+Design and implement `local_orchestrator_engine.py` — a ReAct tool loop driven by
+a local model via Open WebUI's OpenAI-compatible API, as an alternative to the
+Gemini API orchestrator for private/offline tasks.

-### [Testing] Gitea SSH port 2222 ✅ — 2026-03-29
- pfSense WAN → 192.168.32.7:2222 port forward confirmed working
- `ssh -p 2222 git@git.dgrzone.com` reaches Gitea (returns "Invalid repository path" — expected, confirms connectivity)
- Clone/push via SSH: `git clone ssh://git@git.dgrzone.com:2222/<user>/<repo>.git`
+- [ ] Convert existing Cortex tool definitions (`cortex/tools/`) from Gemini
+      `FunctionDeclaration` format to OpenAI `tools` format (minor schema diff)
+- [ ] Implement tool loop: send tools → parse `tool_calls` response → execute →
+      append result → loop until `finish_reason: stop`
+- [ ] Wire into `routers/orchestrator.py` — new `mode` param: `"local"` vs `"gemini"`
+- [ ] UI: Agent mode button routes to local orchestrator when local backend active
+- [ ] Recommended models (scott_gaming, 8 GB VRAM):
+      Gemma 4 E4B — 25 t/s, 72k practical ctx — interactive/fast tasks
+      Gemma 4 26B A4B — 9 t/s, 50k practical ctx — heavier reasoning, background tasks
+- Reference: `docs/OPEN_WEBUI_API.md` for full tool call request/response format

 ---

@@ -30,15 +35,22 @@ See `ARCH__Intelligence_Layer.md` for full design.
 - [ ] Target: markdown files from `~/DgrZone_Nextcloud/` and `~/OSIT_Nextcloud/`
 - [ ] Tag strategy: source path, date, topic tags from frontmatter or filename

-### [Distill] Monitor first auto_distill_long run
- Scheduled for ~April 1 at 04:00
- Manually review `inara/MEMORY_LONG.md` output before fully trusting
- Adjust distill prompts if needed
+### [Distill] Review first auto_distill_long output — 2026-04-01
+- Ran April 1 at 04:00 as scheduled
+- Manually review `inara/MEMORY_LONG.md` — confirm quality before fully trusting
+- Adjust distill prompts in `cortex/memory_distiller.py` if needed

 ### [Distill] Distill quality review
 - Short/mid/long distill prompts live in `cortex/memory_distiller.py`
 - After first few automatic runs, review quality and tune

+### [Local] Unsloth Gemma 4 variants
+- Unsloth Dynamic 2.0 Q4_K_M GGUFs fail with `500: unable to load model` on Ollama v0.20.0
+- Root cause: Ollama's bundled llama.cpp doesn't recognize Gemma 4 GGUF architecture metadata from raw files
+- Waiting on Ollama point release (v0.20.1+) — then switch Open WebUI to Unsloth variants
+- Expected speedup: ~10–20% smaller context footprint vs baseline, same quality
+- `agent-support-gemma-small` → Unsloth E4B Q4_K_M; `agent-support-gemma-medium` → Unsloth 26B A4B Q4_K_M
+
 ---

 ## 🟢 Lower Priority / Future
@@ -61,15 +73,49 @@ See `ARCH__Intelligence_Layer.md`. Full design not yet started.
 - `cortex/routers/` already has pattern; add `gitea.py`
 - Gitea Actions (CI) for "run tests on push" — simpler than custom runner

+### [Local] RAG via Open WebUI
+Open WebUI has a full RAG pipeline (file upload → embed → knowledge collections →
+reference in chat). Could feed Nextcloud docs or session logs into a local knowledge
+base accessible to local models. Endpoints documented in `docs/OPEN_WEBUI_API.md`.
+- `/api/v1/files/` upload + `/api/v1/retrieval/process/web` for URLs
+- Reference in chat via `"files": [{"type": "collection", "id": "..."}]`
+
 ### [Backend] Intelligent model routing
- Currently hardcoded: Claude default, Gemini fallback
- Future: route by task type (code → Claude, search → Gemini, private → Ollama)
- Future: route by context length (Gemini 2.0 has 1M token context)
+- Currently hardcoded: Claude default, Gemini fallback, local third
+- Design direction (now informed by real local model perf):
+  - **Private/offline tasks** → local (Gemma 4 E4B for speed, 26B A4B for reasoning)
+  - **Complex tool tasks / long context** → Gemini (1M token context, strong function calling)
+  - **Final user-facing responses** → Claude (quality prose, persona fidelity)
+- Future: auto-route by task type rather than requiring user to toggle backend manually

 ---

 ## ✅ Completed

+### [Local] Per-user multi-model local LLM settings — 2026-04-01
+- `home/{username}/local_llm.json` — `hosts[]` + `models[]` + `active_model_id` structure
+- `cortex/user_settings.py` — CRUD functions: save_host, add_model, remove_model, set_active_model, get_active_local_model
+- `cortex/routers/local_llm.py` + `cortex/static/local_llm.html` — dedicated `/settings/local` page
+- "Fetch models from host" button — proxied via `/api/local-llm/fetch-models`, populates dropdown
+- Active model shown in UI near backend toggle button (amber hint text)
+- Migrates old flat `.env`-style config automatically on first use
+
+### [UI] Copy button for user (sent) messages — 2026-04-01
+- Added matching copy-on-hover button to user messages (same pattern as assistant messages)
+- `div.dataset.raw` set on send; `makeCopyBtn(div)` appended inline
+
+### [Backend] Local model backend (Open WebUI / Ollama) — 2026-04-01
+- OpenAI-compatible API via `httpx` — no CLI wrapper needed
+- Configured via `LOCAL_API_URL` / `LOCAL_API_KEY` / `LOCAL_MODEL` in `.env`
+- Backend toggle cycles `claude → gemini → local` (amber color in UI)
+- `/auth/status` includes local reachability check (`GET /api/models`)
+- Tested end-to-end: `test-agent-simple` (Qwen3-8B) on `scott-lt-i7-rtx:3000`, full persona context flowing correctly
+
+### [Testing] Gitea SSH port 2222 — 2026-03-29
+- pfSense WAN → 192.168.32.7:2222 port forward confirmed working
+- `ssh -p 2222 git@git.dgrzone.com` reaches Gitea (returns "Invalid repository path" — expected, confirms connectivity)
+- Clone/push via SSH: `git clone ssh://git@git.dgrzone.com:2222/<user>/<repo>.git`
+
 ### [Multi-user] Brian onboarding — 2026-03-29
 - Invite sent to `memedrift@gmail.com`
 - Brian completed onboarding, created `wintermute` persona