# Open WebUI API Reference for Cortex > Last updated: 2026-04-03 > Source: https://docs.openwebui.com/reference/api-endpoints/ > Host in use: `http://192.168.32.19:3000` (scott_gaming — 8 GB VRAM) ## Local Model Performance (scott_gaming, 8 GB VRAM) | Model | Alias | Speed | Practical Context | Spec Context | |---|---|---|---|---| | Gemma 4 E4B | `agent-support-gemma-small` | ~25 t/s | **72k tokens** | 128k | | Gemma 4 26B A4B (MoE) | `agent-support-gemma-medium` | ~9 t/s | **50k tokens** | 256k | Context is VRAM-constrained — spec limits are higher but KV cache fills available VRAM first. Techniques to improve: lower KV cache quantization, flash attention, context length tuning in Ollama. **Practical implications for the local orchestrator:** - System prompt + memory (T2) + tool results + history: budget ~40-50k for small, ~35-40k for medium - Medium at 9 t/s is fine for background/async tasks; small at 25 t/s is responsive enough for interactive use - Both are well above what's needed for most tool loop iterations (~2-5k tokens per round) --- ## Authentication All API calls use a bearer token: ``` Authorization: Bearer sk- ``` API keys are managed in Open WebUI → Settings → Account → API Keys. Cortex stores these per-user in `home/{username}/local_llm.json` → `hosts[].api_key`. --- ## Core Endpoints Used by Cortex ### List Available Models ``` GET /api/models Authorization: Bearer sk-... ``` Returns all models (Ollama, OpenAI-proxied, custom functions). Used by `/api/local-llm/fetch-models` in `routers/local_llm.py`. Response shape: ```json { "data": [ { "id": "gemma4-e4b", "name": "Gemma 4 E4B" }, ... ] } ``` ### Chat Completions (OpenAI-compatible) ``` POST /api/chat/completions Authorization: Bearer sk-... Content-Type: application/json ``` Standard OpenAI chat format. Supports: - `messages` — standard role/content array - `model` — model ID or workspace alias - `tools` + `tool_choice` — function calling (see Tool Loop below) - `stream: true/false` This is the endpoint used by `_local()` in `llm_client.py`. ### Anthropic Messages API Compatibility ``` POST /api/v1/messages Authorization: Bearer sk-... ``` Open WebUI also accepts Anthropic-format requests and auto-converts them. Could be used to route Claude SDK calls through Open WebUI. Base URL for this mode: `http://192.168.32.19:3000/api` ### Direct Ollama Proxy ``` GET /ollama/api/tags — list models POST /ollama/api/generate — streaming completions POST /ollama/api/embed — generate embeddings ``` Use these if you need to bypass Open WebUI's filter layer and hit Ollama directly. Ollama is also accessible directly at `http://192.168.32.19:11434`. --- ## Tool / Function Calling Both Gemma 4 models (E4B and 26B A4B) support function calling via the standard OpenAI `tools` parameter. Open WebUI passes this through to the underlying model. ### Request Format ```json POST /api/chat/completions { "model": "gemma4-26b-a4b", "messages": [ { "role": "system", "content": "..." }, { "role": "user", "content": "What's the weather?" } ], "tools": [ { "type": "function", "function": { "name": "web_search", "description": "Search the web for current information", "parameters": { "type": "object", "properties": { "query": { "type": "string", "description": "Search query" } }, "required": ["query"] } } } ], "tool_choice": "auto" } ``` ### Tool Call Response When the model wants to call a tool, it returns `finish_reason: "tool_calls"`: ```json { "choices": [{ "finish_reason": "tool_calls", "message": { "role": "assistant", "content": null, "tool_calls": [{ "id": "call_abc123", "type": "function", "function": { "name": "web_search", "arguments": "{\"query\": \"current weather NYC\"}" } }] } }] } ``` ### Sending Tool Results Back Append the assistant's tool_call message and a tool result message, then re-submit: ```json { "messages": [ { "role": "user", "content": "What's the weather?" }, { "role": "assistant", "content": null, "tool_calls": [{ "id": "call_abc123", "function": { "name": "web_search", "arguments": "..." } }] }, { "role": "tool", "tool_call_id": "call_abc123", "content": "Current weather in NYC: 62°F, partly cloudy." } ], "tools": [...], "tool_choice": "auto" } ``` Repeat until `finish_reason: "stop"`. --- ## RAG (Retrieval Augmented Generation) ### Upload a File ``` POST /api/v1/files/ Authorization: Bearer sk-... Content-Type: multipart/form-data file=@/path/to/document.pdf ``` Returns a file ID. Poll `/api/v1/files/{id}/process/status` until `completed`. ### Knowledge Collections ``` POST /api/v1/knowledge/{collection_id}/file/add { "file_id": "..." } ``` ### Use in Chat Reference files or knowledge collections in any chat request: ```json { "model": "gemma4-26b-a4b", "messages": [...], "files": [ { "type": "file", "id": "file-id" }, { "type": "collection", "id": "collection-id" } ] } ``` ### Process a Web URL into a Collection ``` POST /api/v1/retrieval/process/web { "url": "https://example.com/article", "collection_id": "..." } ``` --- ## Filter Behavior with Direct API Calls Open WebUI supports inlet/outlet filter pipelines. With direct API access: | Filter | Runs automatically? | |-----------|---------------------| | `inlet()` | ✅ Yes | | `stream()`| ✅ Yes | | `outlet()`| ❌ Manual only — call `POST /api/chat/completed` after receiving response | For Cortex's use case (tool loop orchestration), this is not a concern — we're driving the loop ourselves and don't rely on Open WebUI's filter pipeline. --- ## Relevant Cortex Files | File | Purpose | |---|---| | `cortex/llm_client.py` — `_local()` | Current local backend (direct chat only) | | `cortex/routers/local_llm.py` | Local model settings page + fetch-models endpoint | | `cortex/user_settings.py` | Per-user host + model config (`local_llm.json`) | | `cortex/orchestrator_engine.py` | Gemini API tool loop — reference for local version | | `home/{user}/local_llm.json` | Stored host/model config | --- ## Planned: Local Orchestrator (`local_orchestrator_engine.py`) A local equivalent of `orchestrator_engine.py` that: 1. Takes the same tool definitions already registered in `cortex/tools/` 2. Converts them to OpenAI `tools` format (already close — minor schema diff from Gemini) 3. Runs a ReAct loop against the local model via `/api/chat/completions` 4. Falls back gracefully if the model doesn't return a valid tool call See `documentation/TODO__Agents.md` — `[Local] Tool-capable local orchestrator`. Model recommendation: - **Gemma 4 26B A4B** (256k ctx, MoE — fast for its size) for complex tool tasks - **Gemma 4 E4B** (128k ctx) for lightweight/fast tasks --- ## Notes - Open WebUI workspace aliases (e.g. `agent-support-gemma-small`) resolve to the underlying Ollama model — use aliases in Cortex for human-friendly model names. - `tool_choice: "auto"` lets the model decide; `"none"` forces plain text response; `{"type": "function", "function": {"name": "..."}}` forces a specific tool. - Gemma 4 models support combined tool use + reasoning (thinking tokens) — useful for complex multi-step tasks. - For embeddings (future RAG work), use `/ollama/api/embed` directly.