Local LLM:
- user_settings.py: per-user hosts/models config (local_llm.json)
- routers/local_llm.py + static/local_llm.html: dedicated settings page
- llm_client.py: local OpenAI-compatible backend via httpx
- config.py: LOCAL_API_URL/KEY/MODEL + per-backend timeouts
- Active model shown near backend toggle (amber hint text)
Memory distillation:
- memory_distiller.py: DISTILL_BACKEND_MID/LONG .env overrides
- scheduler.py + notification.py: notify NC Talk after mid/long distill
- notification.py: outbound channel abstraction (NC Talk, extensible)
Session search:
- routers/files.py: GET /sessions/search?q= with excerpts grouped by date
- static/index.html + app.js: search UI in file sidebar with highlight
- _esc() helper to prevent XSS in search results
Proactive cron:
- cron_runner.py: new job types — message (send directly) and brief (LLM + send)
- Both support optional per-job channel override
Channels:
- routers/nextcloud_talk.py: consolidated using notification._send_nct_message()
- routers/auth.py: local backend status in /auth/status
- routers/chat.py: /backend returns {primary, fallback, local_model} object
UI / UX:
- Copy button for user messages (matching assistant)
- Autocomplete disabled on sensitive form fields
- settings.html: local model section replaced with link to /settings/local
Docs overhaul:
- MASTER.md hub + ARCH__SYSTEM/BACKENDS/PERSONA/CHANNELS/FUTURE.md
- ARCH__Intelligence_Layer.md replaced with redirect table
- CORTEX.md trimmed to vision only; README updated
- OPEN_WEBUI_API.md added to docs/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.4 KiB
Open WebUI API Reference for Cortex
Last updated: 2026-04-03 Source: https://docs.openwebui.com/reference/api-endpoints/ Host in use:
http://192.168.32.19:3000(scott_gaming — 8 GB VRAM)
Local Model Performance (scott_gaming, 8 GB VRAM)
| Model | Alias | Speed | Practical Context | Spec Context |
|---|---|---|---|---|
| Gemma 4 E4B | agent-support-gemma-small |
~25 t/s | 72k tokens | 128k |
| Gemma 4 26B A4B (MoE) | agent-support-gemma-medium |
~9 t/s | 50k tokens | 256k |
Context is VRAM-constrained — spec limits are higher but KV cache fills available VRAM first. Techniques to improve: lower KV cache quantization, flash attention, context length tuning in Ollama.
Practical implications for the local orchestrator:
- System prompt + memory (T2) + tool results + history: budget ~40-50k for small, ~35-40k for medium
- Medium at 9 t/s is fine for background/async tasks; small at 25 t/s is responsive enough for interactive use
- Both are well above what's needed for most tool loop iterations (~2-5k tokens per round)
Authentication
All API calls use a bearer token:
Authorization: Bearer sk-<api-key>
API keys are managed in Open WebUI → Settings → Account → API Keys.
Cortex stores these per-user in home/{username}/local_llm.json → hosts[].api_key.
Core Endpoints Used by Cortex
List Available Models
GET /api/models
Authorization: Bearer sk-...
Returns all models (Ollama, OpenAI-proxied, custom functions).
Used by /api/local-llm/fetch-models in routers/local_llm.py.
Response shape:
{
"data": [
{ "id": "gemma4-e4b", "name": "Gemma 4 E4B" },
...
]
}
Chat Completions (OpenAI-compatible)
POST /api/chat/completions
Authorization: Bearer sk-...
Content-Type: application/json
Standard OpenAI chat format. Supports:
messages— standard role/content arraymodel— model ID or workspace aliastools+tool_choice— function calling (see Tool Loop below)stream: true/false
This is the endpoint used by _local() in llm_client.py.
Anthropic Messages API Compatibility
POST /api/v1/messages
Authorization: Bearer sk-...
Open WebUI also accepts Anthropic-format requests and auto-converts them.
Could be used to route Claude SDK calls through Open WebUI.
Base URL for this mode: http://192.168.32.19:3000/api
Direct Ollama Proxy
GET /ollama/api/tags — list models
POST /ollama/api/generate — streaming completions
POST /ollama/api/embed — generate embeddings
Use these if you need to bypass Open WebUI's filter layer and hit Ollama directly.
Ollama is also accessible directly at http://192.168.32.19:11434.
Tool / Function Calling
Both Gemma 4 models (E4B and 26B A4B) support function calling via the standard
OpenAI tools parameter. Open WebUI passes this through to the underlying model.
Request Format
POST /api/chat/completions
{
"model": "gemma4-26b-a4b",
"messages": [
{ "role": "system", "content": "..." },
{ "role": "user", "content": "What's the weather?" }
],
"tools": [
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web for current information",
"parameters": {
"type": "object",
"properties": {
"query": { "type": "string", "description": "Search query" }
},
"required": ["query"]
}
}
}
],
"tool_choice": "auto"
}
Tool Call Response
When the model wants to call a tool, it returns finish_reason: "tool_calls":
{
"choices": [{
"finish_reason": "tool_calls",
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "web_search",
"arguments": "{\"query\": \"current weather NYC\"}"
}
}]
}
}]
}
Sending Tool Results Back
Append the assistant's tool_call message and a tool result message, then re-submit:
{
"messages": [
{ "role": "user", "content": "What's the weather?" },
{ "role": "assistant", "content": null,
"tool_calls": [{ "id": "call_abc123", "function": { "name": "web_search", "arguments": "..." } }] },
{ "role": "tool", "tool_call_id": "call_abc123",
"content": "Current weather in NYC: 62°F, partly cloudy." }
],
"tools": [...],
"tool_choice": "auto"
}
Repeat until finish_reason: "stop".
RAG (Retrieval Augmented Generation)
Upload a File
POST /api/v1/files/
Authorization: Bearer sk-...
Content-Type: multipart/form-data
file=@/path/to/document.pdf
Returns a file ID. Poll /api/v1/files/{id}/process/status until completed.
Knowledge Collections
POST /api/v1/knowledge/{collection_id}/file/add
{ "file_id": "..." }
Use in Chat
Reference files or knowledge collections in any chat request:
{
"model": "gemma4-26b-a4b",
"messages": [...],
"files": [
{ "type": "file", "id": "file-id" },
{ "type": "collection", "id": "collection-id" }
]
}
Process a Web URL into a Collection
POST /api/v1/retrieval/process/web
{ "url": "https://example.com/article", "collection_id": "..." }
Filter Behavior with Direct API Calls
Open WebUI supports inlet/outlet filter pipelines. With direct API access:
| Filter | Runs automatically? |
|---|---|
inlet() |
✅ Yes |
stream() |
✅ Yes |
outlet() |
❌ Manual only — call POST /api/chat/completed after receiving response |
For Cortex's use case (tool loop orchestration), this is not a concern — we're driving the loop ourselves and don't rely on Open WebUI's filter pipeline.
Relevant Cortex Files
| File | Purpose |
|---|---|
cortex/llm_client.py — _local() |
Current local backend (direct chat only) |
cortex/routers/local_llm.py |
Local model settings page + fetch-models endpoint |
cortex/user_settings.py |
Per-user host + model config (local_llm.json) |
cortex/orchestrator_engine.py |
Gemini API tool loop — reference for local version |
home/{user}/local_llm.json |
Stored host/model config |
Planned: Local Orchestrator (local_orchestrator_engine.py)
A local equivalent of orchestrator_engine.py that:
- Takes the same tool definitions already registered in
cortex/tools/ - Converts them to OpenAI
toolsformat (already close — minor schema diff from Gemini) - Runs a ReAct loop against the local model via
/api/chat/completions - Falls back gracefully if the model doesn't return a valid tool call
See documentation/TODO__Agents.md — [Local] Tool-capable local orchestrator.
Model recommendation:
- Gemma 4 26B A4B (256k ctx, MoE — fast for its size) for complex tool tasks
- Gemma 4 E4B (128k ctx) for lightweight/fast tasks
Notes
- Open WebUI workspace aliases (e.g.
agent-support-gemma-small) resolve to the underlying Ollama model — use aliases in Cortex for human-friendly model names. tool_choice: "auto"lets the model decide;"none"forces plain text response;{"type": "function", "function": {"name": "..."}}forces a specific tool.- Gemma 4 models support combined tool use + reasoning (thinking tokens) — useful for complex multi-step tasks.
- For embeddings (future RAG work), use
/ollama/api/embeddirectly.