Local LLM:
- user_settings.py: per-user hosts/models config (local_llm.json)
- routers/local_llm.py + static/local_llm.html: dedicated settings page
- llm_client.py: local OpenAI-compatible backend via httpx
- config.py: LOCAL_API_URL/KEY/MODEL + per-backend timeouts
- Active model shown near backend toggle (amber hint text)
Memory distillation:
- memory_distiller.py: DISTILL_BACKEND_MID/LONG .env overrides
- scheduler.py + notification.py: notify NC Talk after mid/long distill
- notification.py: outbound channel abstraction (NC Talk, extensible)
Session search:
- routers/files.py: GET /sessions/search?q= with excerpts grouped by date
- static/index.html + app.js: search UI in file sidebar with highlight
- _esc() helper to prevent XSS in search results
Proactive cron:
- cron_runner.py: new job types — message (send directly) and brief (LLM + send)
- Both support optional per-job channel override
Channels:
- routers/nextcloud_talk.py: consolidated using notification._send_nct_message()
- routers/auth.py: local backend status in /auth/status
- routers/chat.py: /backend returns {primary, fallback, local_model} object
UI / UX:
- Copy button for user messages (matching assistant)
- Autocomplete disabled on sensitive form fields
- settings.html: local model section replaced with link to /settings/local
Docs overhaul:
- MASTER.md hub + ARCH__SYSTEM/BACKENDS/PERSONA/CHANNELS/FUTURE.md
- ARCH__Intelligence_Layer.md replaced with redirect table
- CORTEX.md trimmed to vision only; README updated
- OPEN_WEBUI_API.md added to docs/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
277 lines
7.4 KiB
Markdown
277 lines
7.4 KiB
Markdown
# Open WebUI API Reference for Cortex
|
|
|
|
> Last updated: 2026-04-03
|
|
> Source: https://docs.openwebui.com/reference/api-endpoints/
|
|
> Host in use: `http://192.168.32.19:3000` (scott_gaming — 8 GB VRAM)
|
|
|
|
## Local Model Performance (scott_gaming, 8 GB VRAM)
|
|
|
|
| Model | Alias | Speed | Practical Context | Spec Context |
|
|
|---|---|---|---|---|
|
|
| Gemma 4 E4B | `agent-support-gemma-small` | ~25 t/s | **72k tokens** | 128k |
|
|
| Gemma 4 26B A4B (MoE) | `agent-support-gemma-medium` | ~9 t/s | **50k tokens** | 256k |
|
|
|
|
Context is VRAM-constrained — spec limits are higher but KV cache fills available VRAM first.
|
|
Techniques to improve: lower KV cache quantization, flash attention, context length tuning in Ollama.
|
|
|
|
**Practical implications for the local orchestrator:**
|
|
- System prompt + memory (T2) + tool results + history: budget ~40-50k for small, ~35-40k for medium
|
|
- Medium at 9 t/s is fine for background/async tasks; small at 25 t/s is responsive enough for interactive use
|
|
- Both are well above what's needed for most tool loop iterations (~2-5k tokens per round)
|
|
|
|
---
|
|
|
|
## Authentication
|
|
|
|
All API calls use a bearer token:
|
|
|
|
```
|
|
Authorization: Bearer sk-<api-key>
|
|
```
|
|
|
|
API keys are managed in Open WebUI → Settings → Account → API Keys.
|
|
Cortex stores these per-user in `home/{username}/local_llm.json` → `hosts[].api_key`.
|
|
|
|
---
|
|
|
|
## Core Endpoints Used by Cortex
|
|
|
|
### List Available Models
|
|
|
|
```
|
|
GET /api/models
|
|
Authorization: Bearer sk-...
|
|
```
|
|
|
|
Returns all models (Ollama, OpenAI-proxied, custom functions).
|
|
Used by `/api/local-llm/fetch-models` in `routers/local_llm.py`.
|
|
|
|
Response shape:
|
|
```json
|
|
{
|
|
"data": [
|
|
{ "id": "gemma4-e4b", "name": "Gemma 4 E4B" },
|
|
...
|
|
]
|
|
}
|
|
```
|
|
|
|
### Chat Completions (OpenAI-compatible)
|
|
|
|
```
|
|
POST /api/chat/completions
|
|
Authorization: Bearer sk-...
|
|
Content-Type: application/json
|
|
```
|
|
|
|
Standard OpenAI chat format. Supports:
|
|
- `messages` — standard role/content array
|
|
- `model` — model ID or workspace alias
|
|
- `tools` + `tool_choice` — function calling (see Tool Loop below)
|
|
- `stream: true/false`
|
|
|
|
This is the endpoint used by `_local()` in `llm_client.py`.
|
|
|
|
### Anthropic Messages API Compatibility
|
|
|
|
```
|
|
POST /api/v1/messages
|
|
Authorization: Bearer sk-...
|
|
```
|
|
|
|
Open WebUI also accepts Anthropic-format requests and auto-converts them.
|
|
Could be used to route Claude SDK calls through Open WebUI.
|
|
Base URL for this mode: `http://192.168.32.19:3000/api`
|
|
|
|
### Direct Ollama Proxy
|
|
|
|
```
|
|
GET /ollama/api/tags — list models
|
|
POST /ollama/api/generate — streaming completions
|
|
POST /ollama/api/embed — generate embeddings
|
|
```
|
|
|
|
Use these if you need to bypass Open WebUI's filter layer and hit Ollama directly.
|
|
Ollama is also accessible directly at `http://192.168.32.19:11434`.
|
|
|
|
---
|
|
|
|
## Tool / Function Calling
|
|
|
|
Both Gemma 4 models (E4B and 26B A4B) support function calling via the standard
|
|
OpenAI `tools` parameter. Open WebUI passes this through to the underlying model.
|
|
|
|
### Request Format
|
|
|
|
```json
|
|
POST /api/chat/completions
|
|
{
|
|
"model": "gemma4-26b-a4b",
|
|
"messages": [
|
|
{ "role": "system", "content": "..." },
|
|
{ "role": "user", "content": "What's the weather?" }
|
|
],
|
|
"tools": [
|
|
{
|
|
"type": "function",
|
|
"function": {
|
|
"name": "web_search",
|
|
"description": "Search the web for current information",
|
|
"parameters": {
|
|
"type": "object",
|
|
"properties": {
|
|
"query": { "type": "string", "description": "Search query" }
|
|
},
|
|
"required": ["query"]
|
|
}
|
|
}
|
|
}
|
|
],
|
|
"tool_choice": "auto"
|
|
}
|
|
```
|
|
|
|
### Tool Call Response
|
|
|
|
When the model wants to call a tool, it returns `finish_reason: "tool_calls"`:
|
|
|
|
```json
|
|
{
|
|
"choices": [{
|
|
"finish_reason": "tool_calls",
|
|
"message": {
|
|
"role": "assistant",
|
|
"content": null,
|
|
"tool_calls": [{
|
|
"id": "call_abc123",
|
|
"type": "function",
|
|
"function": {
|
|
"name": "web_search",
|
|
"arguments": "{\"query\": \"current weather NYC\"}"
|
|
}
|
|
}]
|
|
}
|
|
}]
|
|
}
|
|
```
|
|
|
|
### Sending Tool Results Back
|
|
|
|
Append the assistant's tool_call message and a tool result message, then re-submit:
|
|
|
|
```json
|
|
{
|
|
"messages": [
|
|
{ "role": "user", "content": "What's the weather?" },
|
|
{ "role": "assistant", "content": null,
|
|
"tool_calls": [{ "id": "call_abc123", "function": { "name": "web_search", "arguments": "..." } }] },
|
|
{ "role": "tool", "tool_call_id": "call_abc123",
|
|
"content": "Current weather in NYC: 62°F, partly cloudy." }
|
|
],
|
|
"tools": [...],
|
|
"tool_choice": "auto"
|
|
}
|
|
```
|
|
|
|
Repeat until `finish_reason: "stop"`.
|
|
|
|
---
|
|
|
|
## RAG (Retrieval Augmented Generation)
|
|
|
|
### Upload a File
|
|
|
|
```
|
|
POST /api/v1/files/
|
|
Authorization: Bearer sk-...
|
|
Content-Type: multipart/form-data
|
|
|
|
file=@/path/to/document.pdf
|
|
```
|
|
|
|
Returns a file ID. Poll `/api/v1/files/{id}/process/status` until `completed`.
|
|
|
|
### Knowledge Collections
|
|
|
|
```
|
|
POST /api/v1/knowledge/{collection_id}/file/add
|
|
{ "file_id": "..." }
|
|
```
|
|
|
|
### Use in Chat
|
|
|
|
Reference files or knowledge collections in any chat request:
|
|
|
|
```json
|
|
{
|
|
"model": "gemma4-26b-a4b",
|
|
"messages": [...],
|
|
"files": [
|
|
{ "type": "file", "id": "file-id" },
|
|
{ "type": "collection", "id": "collection-id" }
|
|
]
|
|
}
|
|
```
|
|
|
|
### Process a Web URL into a Collection
|
|
|
|
```
|
|
POST /api/v1/retrieval/process/web
|
|
{ "url": "https://example.com/article", "collection_id": "..." }
|
|
```
|
|
|
|
---
|
|
|
|
## Filter Behavior with Direct API Calls
|
|
|
|
Open WebUI supports inlet/outlet filter pipelines. With direct API access:
|
|
|
|
| Filter | Runs automatically? |
|
|
|-----------|---------------------|
|
|
| `inlet()` | ✅ Yes |
|
|
| `stream()`| ✅ Yes |
|
|
| `outlet()`| ❌ Manual only — call `POST /api/chat/completed` after receiving response |
|
|
|
|
For Cortex's use case (tool loop orchestration), this is not a concern — we're
|
|
driving the loop ourselves and don't rely on Open WebUI's filter pipeline.
|
|
|
|
---
|
|
|
|
## Relevant Cortex Files
|
|
|
|
| File | Purpose |
|
|
|---|---|
|
|
| `cortex/llm_client.py` — `_local()` | Current local backend (direct chat only) |
|
|
| `cortex/routers/local_llm.py` | Local model settings page + fetch-models endpoint |
|
|
| `cortex/user_settings.py` | Per-user host + model config (`local_llm.json`) |
|
|
| `cortex/orchestrator_engine.py` | Gemini API tool loop — reference for local version |
|
|
| `home/{user}/local_llm.json` | Stored host/model config |
|
|
|
|
---
|
|
|
|
## Planned: Local Orchestrator (`local_orchestrator_engine.py`)
|
|
|
|
A local equivalent of `orchestrator_engine.py` that:
|
|
1. Takes the same tool definitions already registered in `cortex/tools/`
|
|
2. Converts them to OpenAI `tools` format (already close — minor schema diff from Gemini)
|
|
3. Runs a ReAct loop against the local model via `/api/chat/completions`
|
|
4. Falls back gracefully if the model doesn't return a valid tool call
|
|
|
|
See `documentation/TODO__Agents.md` — `[Local] Tool-capable local orchestrator`.
|
|
|
|
Model recommendation:
|
|
- **Gemma 4 26B A4B** (256k ctx, MoE — fast for its size) for complex tool tasks
|
|
- **Gemma 4 E4B** (128k ctx) for lightweight/fast tasks
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- Open WebUI workspace aliases (e.g. `agent-support-gemma-small`) resolve to the
|
|
underlying Ollama model — use aliases in Cortex for human-friendly model names.
|
|
- `tool_choice: "auto"` lets the model decide; `"none"` forces plain text response;
|
|
`{"type": "function", "function": {"name": "..."}}` forces a specific tool.
|
|
- Gemma 4 models support combined tool use + reasoning (thinking tokens) — useful
|
|
for complex multi-step tasks.
|
|
- For embeddings (future RAG work), use `/ollama/api/embed` directly.
|