Files
Cortex-Inara/docs/OPEN_WEBUI_API.md
Scott Idem a4daebdc9b feat: local LLM multi-model, session search, cron proactive types, notifications, docs overhaul
Local LLM:
- user_settings.py: per-user hosts/models config (local_llm.json)
- routers/local_llm.py + static/local_llm.html: dedicated settings page
- llm_client.py: local OpenAI-compatible backend via httpx
- config.py: LOCAL_API_URL/KEY/MODEL + per-backend timeouts
- Active model shown near backend toggle (amber hint text)

Memory distillation:
- memory_distiller.py: DISTILL_BACKEND_MID/LONG .env overrides
- scheduler.py + notification.py: notify NC Talk after mid/long distill
- notification.py: outbound channel abstraction (NC Talk, extensible)

Session search:
- routers/files.py: GET /sessions/search?q= with excerpts grouped by date
- static/index.html + app.js: search UI in file sidebar with highlight
- _esc() helper to prevent XSS in search results

Proactive cron:
- cron_runner.py: new job types — message (send directly) and brief (LLM + send)
- Both support optional per-job channel override

Channels:
- routers/nextcloud_talk.py: consolidated using notification._send_nct_message()
- routers/auth.py: local backend status in /auth/status
- routers/chat.py: /backend returns {primary, fallback, local_model} object

UI / UX:
- Copy button for user messages (matching assistant)
- Autocomplete disabled on sensitive form fields
- settings.html: local model section replaced with link to /settings/local

Docs overhaul:
- MASTER.md hub + ARCH__SYSTEM/BACKENDS/PERSONA/CHANNELS/FUTURE.md
- ARCH__Intelligence_Layer.md replaced with redirect table
- CORTEX.md trimmed to vision only; README updated
- OPEN_WEBUI_API.md added to docs/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 20:53:06 -04:00

277 lines
7.4 KiB
Markdown

# Open WebUI API Reference for Cortex
> Last updated: 2026-04-03
> Source: https://docs.openwebui.com/reference/api-endpoints/
> Host in use: `http://192.168.32.19:3000` (scott_gaming — 8 GB VRAM)
## Local Model Performance (scott_gaming, 8 GB VRAM)
| Model | Alias | Speed | Practical Context | Spec Context |
|---|---|---|---|---|
| Gemma 4 E4B | `agent-support-gemma-small` | ~25 t/s | **72k tokens** | 128k |
| Gemma 4 26B A4B (MoE) | `agent-support-gemma-medium` | ~9 t/s | **50k tokens** | 256k |
Context is VRAM-constrained — spec limits are higher but KV cache fills available VRAM first.
Techniques to improve: lower KV cache quantization, flash attention, context length tuning in Ollama.
**Practical implications for the local orchestrator:**
- System prompt + memory (T2) + tool results + history: budget ~40-50k for small, ~35-40k for medium
- Medium at 9 t/s is fine for background/async tasks; small at 25 t/s is responsive enough for interactive use
- Both are well above what's needed for most tool loop iterations (~2-5k tokens per round)
---
## Authentication
All API calls use a bearer token:
```
Authorization: Bearer sk-<api-key>
```
API keys are managed in Open WebUI → Settings → Account → API Keys.
Cortex stores these per-user in `home/{username}/local_llm.json``hosts[].api_key`.
---
## Core Endpoints Used by Cortex
### List Available Models
```
GET /api/models
Authorization: Bearer sk-...
```
Returns all models (Ollama, OpenAI-proxied, custom functions).
Used by `/api/local-llm/fetch-models` in `routers/local_llm.py`.
Response shape:
```json
{
"data": [
{ "id": "gemma4-e4b", "name": "Gemma 4 E4B" },
...
]
}
```
### Chat Completions (OpenAI-compatible)
```
POST /api/chat/completions
Authorization: Bearer sk-...
Content-Type: application/json
```
Standard OpenAI chat format. Supports:
- `messages` — standard role/content array
- `model` — model ID or workspace alias
- `tools` + `tool_choice` — function calling (see Tool Loop below)
- `stream: true/false`
This is the endpoint used by `_local()` in `llm_client.py`.
### Anthropic Messages API Compatibility
```
POST /api/v1/messages
Authorization: Bearer sk-...
```
Open WebUI also accepts Anthropic-format requests and auto-converts them.
Could be used to route Claude SDK calls through Open WebUI.
Base URL for this mode: `http://192.168.32.19:3000/api`
### Direct Ollama Proxy
```
GET /ollama/api/tags — list models
POST /ollama/api/generate — streaming completions
POST /ollama/api/embed — generate embeddings
```
Use these if you need to bypass Open WebUI's filter layer and hit Ollama directly.
Ollama is also accessible directly at `http://192.168.32.19:11434`.
---
## Tool / Function Calling
Both Gemma 4 models (E4B and 26B A4B) support function calling via the standard
OpenAI `tools` parameter. Open WebUI passes this through to the underlying model.
### Request Format
```json
POST /api/chat/completions
{
"model": "gemma4-26b-a4b",
"messages": [
{ "role": "system", "content": "..." },
{ "role": "user", "content": "What's the weather?" }
],
"tools": [
{
"type": "function",
"function": {
"name": "web_search",
"description": "Search the web for current information",
"parameters": {
"type": "object",
"properties": {
"query": { "type": "string", "description": "Search query" }
},
"required": ["query"]
}
}
}
],
"tool_choice": "auto"
}
```
### Tool Call Response
When the model wants to call a tool, it returns `finish_reason: "tool_calls"`:
```json
{
"choices": [{
"finish_reason": "tool_calls",
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "web_search",
"arguments": "{\"query\": \"current weather NYC\"}"
}
}]
}
}]
}
```
### Sending Tool Results Back
Append the assistant's tool_call message and a tool result message, then re-submit:
```json
{
"messages": [
{ "role": "user", "content": "What's the weather?" },
{ "role": "assistant", "content": null,
"tool_calls": [{ "id": "call_abc123", "function": { "name": "web_search", "arguments": "..." } }] },
{ "role": "tool", "tool_call_id": "call_abc123",
"content": "Current weather in NYC: 62°F, partly cloudy." }
],
"tools": [...],
"tool_choice": "auto"
}
```
Repeat until `finish_reason: "stop"`.
---
## RAG (Retrieval Augmented Generation)
### Upload a File
```
POST /api/v1/files/
Authorization: Bearer sk-...
Content-Type: multipart/form-data
file=@/path/to/document.pdf
```
Returns a file ID. Poll `/api/v1/files/{id}/process/status` until `completed`.
### Knowledge Collections
```
POST /api/v1/knowledge/{collection_id}/file/add
{ "file_id": "..." }
```
### Use in Chat
Reference files or knowledge collections in any chat request:
```json
{
"model": "gemma4-26b-a4b",
"messages": [...],
"files": [
{ "type": "file", "id": "file-id" },
{ "type": "collection", "id": "collection-id" }
]
}
```
### Process a Web URL into a Collection
```
POST /api/v1/retrieval/process/web
{ "url": "https://example.com/article", "collection_id": "..." }
```
---
## Filter Behavior with Direct API Calls
Open WebUI supports inlet/outlet filter pipelines. With direct API access:
| Filter | Runs automatically? |
|-----------|---------------------|
| `inlet()` | ✅ Yes |
| `stream()`| ✅ Yes |
| `outlet()`| ❌ Manual only — call `POST /api/chat/completed` after receiving response |
For Cortex's use case (tool loop orchestration), this is not a concern — we're
driving the loop ourselves and don't rely on Open WebUI's filter pipeline.
---
## Relevant Cortex Files
| File | Purpose |
|---|---|
| `cortex/llm_client.py``_local()` | Current local backend (direct chat only) |
| `cortex/routers/local_llm.py` | Local model settings page + fetch-models endpoint |
| `cortex/user_settings.py` | Per-user host + model config (`local_llm.json`) |
| `cortex/orchestrator_engine.py` | Gemini API tool loop — reference for local version |
| `home/{user}/local_llm.json` | Stored host/model config |
---
## Planned: Local Orchestrator (`local_orchestrator_engine.py`)
A local equivalent of `orchestrator_engine.py` that:
1. Takes the same tool definitions already registered in `cortex/tools/`
2. Converts them to OpenAI `tools` format (already close — minor schema diff from Gemini)
3. Runs a ReAct loop against the local model via `/api/chat/completions`
4. Falls back gracefully if the model doesn't return a valid tool call
See `documentation/TODO__Agents.md``[Local] Tool-capable local orchestrator`.
Model recommendation:
- **Gemma 4 26B A4B** (256k ctx, MoE — fast for its size) for complex tool tasks
- **Gemma 4 E4B** (128k ctx) for lightweight/fast tasks
---
## Notes
- Open WebUI workspace aliases (e.g. `agent-support-gemma-small`) resolve to the
underlying Ollama model — use aliases in Cortex for human-friendly model names.
- `tool_choice: "auto"` lets the model decide; `"none"` forces plain text response;
`{"type": "function", "function": {"name": "..."}}` forces a specific tool.
- Gemma 4 models support combined tool use + reasoning (thinking tokens) — useful
for complex multi-step tasks.
- For embeddings (future RAG work), use `/ollama/api/embed` directly.