Files
Cortex-Inara/docs/OPEN_WEBUI_API.md
Scott Idem a4daebdc9b feat: local LLM multi-model, session search, cron proactive types, notifications, docs overhaul
Local LLM:
- user_settings.py: per-user hosts/models config (local_llm.json)
- routers/local_llm.py + static/local_llm.html: dedicated settings page
- llm_client.py: local OpenAI-compatible backend via httpx
- config.py: LOCAL_API_URL/KEY/MODEL + per-backend timeouts
- Active model shown near backend toggle (amber hint text)

Memory distillation:
- memory_distiller.py: DISTILL_BACKEND_MID/LONG .env overrides
- scheduler.py + notification.py: notify NC Talk after mid/long distill
- notification.py: outbound channel abstraction (NC Talk, extensible)

Session search:
- routers/files.py: GET /sessions/search?q= with excerpts grouped by date
- static/index.html + app.js: search UI in file sidebar with highlight
- _esc() helper to prevent XSS in search results

Proactive cron:
- cron_runner.py: new job types — message (send directly) and brief (LLM + send)
- Both support optional per-job channel override

Channels:
- routers/nextcloud_talk.py: consolidated using notification._send_nct_message()
- routers/auth.py: local backend status in /auth/status
- routers/chat.py: /backend returns {primary, fallback, local_model} object

UI / UX:
- Copy button for user messages (matching assistant)
- Autocomplete disabled on sensitive form fields
- settings.html: local model section replaced with link to /settings/local

Docs overhaul:
- MASTER.md hub + ARCH__SYSTEM/BACKENDS/PERSONA/CHANNELS/FUTURE.md
- ARCH__Intelligence_Layer.md replaced with redirect table
- CORTEX.md trimmed to vision only; README updated
- OPEN_WEBUI_API.md added to docs/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:53:06 -04:00

7.4 KiB

Open WebUI API Reference for Cortex

Last updated: 2026-04-03 Source: https://docs.openwebui.com/reference/api-endpoints/ Host in use: http://192.168.32.19:3000 (scott_gaming — 8 GB VRAM)

Local Model Performance (scott_gaming, 8 GB VRAM)

Model Alias Speed Practical Context Spec Context
Gemma 4 E4B agent-support-gemma-small ~25 t/s 72k tokens 128k
Gemma 4 26B A4B (MoE) agent-support-gemma-medium ~9 t/s 50k tokens 256k

Context is VRAM-constrained — spec limits are higher but KV cache fills available VRAM first. Techniques to improve: lower KV cache quantization, flash attention, context length tuning in Ollama.

Practical implications for the local orchestrator:

  • System prompt + memory (T2) + tool results + history: budget ~40-50k for small, ~35-40k for medium
  • Medium at 9 t/s is fine for background/async tasks; small at 25 t/s is responsive enough for interactive use
  • Both are well above what's needed for most tool loop iterations (~2-5k tokens per round)

Authentication

All API calls use a bearer token:

Authorization: Bearer sk-<api-key>

API keys are managed in Open WebUI → Settings → Account → API Keys. Cortex stores these per-user in home/{username}/local_llm.jsonhosts[].api_key.


Core Endpoints Used by Cortex

List Available Models

GET /api/models
Authorization: Bearer sk-...

Returns all models (Ollama, OpenAI-proxied, custom functions). Used by /api/local-llm/fetch-models in routers/local_llm.py.

Response shape:

{
  "data": [
    { "id": "gemma4-e4b", "name": "Gemma 4 E4B" },
    ...
  ]
}

Chat Completions (OpenAI-compatible)

POST /api/chat/completions
Authorization: Bearer sk-...
Content-Type: application/json

Standard OpenAI chat format. Supports:

  • messages — standard role/content array
  • model — model ID or workspace alias
  • tools + tool_choice — function calling (see Tool Loop below)
  • stream: true/false

This is the endpoint used by _local() in llm_client.py.

Anthropic Messages API Compatibility

POST /api/v1/messages
Authorization: Bearer sk-...

Open WebUI also accepts Anthropic-format requests and auto-converts them. Could be used to route Claude SDK calls through Open WebUI. Base URL for this mode: http://192.168.32.19:3000/api

Direct Ollama Proxy

GET  /ollama/api/tags        — list models
POST /ollama/api/generate    — streaming completions
POST /ollama/api/embed       — generate embeddings

Use these if you need to bypass Open WebUI's filter layer and hit Ollama directly. Ollama is also accessible directly at http://192.168.32.19:11434.


Tool / Function Calling

Both Gemma 4 models (E4B and 26B A4B) support function calling via the standard OpenAI tools parameter. Open WebUI passes this through to the underlying model.

Request Format

POST /api/chat/completions
{
  "model": "gemma4-26b-a4b",
  "messages": [
    { "role": "system", "content": "..." },
    { "role": "user",   "content": "What's the weather?" }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "web_search",
        "description": "Search the web for current information",
        "parameters": {
          "type": "object",
          "properties": {
            "query": { "type": "string", "description": "Search query" }
          },
          "required": ["query"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Tool Call Response

When the model wants to call a tool, it returns finish_reason: "tool_calls":

{
  "choices": [{
    "finish_reason": "tool_calls",
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "web_search",
          "arguments": "{\"query\": \"current weather NYC\"}"
        }
      }]
    }
  }]
}

Sending Tool Results Back

Append the assistant's tool_call message and a tool result message, then re-submit:

{
  "messages": [
    { "role": "user",      "content": "What's the weather?" },
    { "role": "assistant", "content": null,
      "tool_calls": [{ "id": "call_abc123", "function": { "name": "web_search", "arguments": "..." } }] },
    { "role": "tool",      "tool_call_id": "call_abc123",
      "content": "Current weather in NYC: 62°F, partly cloudy." }
  ],
  "tools": [...],
  "tool_choice": "auto"
}

Repeat until finish_reason: "stop".


RAG (Retrieval Augmented Generation)

Upload a File

POST /api/v1/files/
Authorization: Bearer sk-...
Content-Type: multipart/form-data

file=@/path/to/document.pdf

Returns a file ID. Poll /api/v1/files/{id}/process/status until completed.

Knowledge Collections

POST /api/v1/knowledge/{collection_id}/file/add
{ "file_id": "..." }

Use in Chat

Reference files or knowledge collections in any chat request:

{
  "model": "gemma4-26b-a4b",
  "messages": [...],
  "files": [
    { "type": "file",       "id": "file-id" },
    { "type": "collection", "id": "collection-id" }
  ]
}

Process a Web URL into a Collection

POST /api/v1/retrieval/process/web
{ "url": "https://example.com/article", "collection_id": "..." }

Filter Behavior with Direct API Calls

Open WebUI supports inlet/outlet filter pipelines. With direct API access:

Filter Runs automatically?
inlet() Yes
stream() Yes
outlet() Manual only — call POST /api/chat/completed after receiving response

For Cortex's use case (tool loop orchestration), this is not a concern — we're driving the loop ourselves and don't rely on Open WebUI's filter pipeline.


Relevant Cortex Files

File Purpose
cortex/llm_client.py_local() Current local backend (direct chat only)
cortex/routers/local_llm.py Local model settings page + fetch-models endpoint
cortex/user_settings.py Per-user host + model config (local_llm.json)
cortex/orchestrator_engine.py Gemini API tool loop — reference for local version
home/{user}/local_llm.json Stored host/model config

Planned: Local Orchestrator (local_orchestrator_engine.py)

A local equivalent of orchestrator_engine.py that:

  1. Takes the same tool definitions already registered in cortex/tools/
  2. Converts them to OpenAI tools format (already close — minor schema diff from Gemini)
  3. Runs a ReAct loop against the local model via /api/chat/completions
  4. Falls back gracefully if the model doesn't return a valid tool call

See documentation/TODO__Agents.md[Local] Tool-capable local orchestrator.

Model recommendation:

  • Gemma 4 26B A4B (256k ctx, MoE — fast for its size) for complex tool tasks
  • Gemma 4 E4B (128k ctx) for lightweight/fast tasks

Notes

  • Open WebUI workspace aliases (e.g. agent-support-gemma-small) resolve to the underlying Ollama model — use aliases in Cortex for human-friendly model names.
  • tool_choice: "auto" lets the model decide; "none" forces plain text response; {"type": "function", "function": {"name": "..."}} forces a specific tool.
  • Gemma 4 models support combined tool use + reasoning (thinking tokens) — useful for complex multi-step tasks.
  • For embeddings (future RAG work), use /ollama/api/embed directly.