Cortex-Inara/docs/OPEN_WEBUI_API.md

# Open WebUI API Reference for Cortex

> Last updated: 2026-04-03
> Source: https://docs.openwebui.com/reference/api-endpoints/
> Host in use: `http://192.168.32.19:3000` (scott_gaming — 8 GB VRAM)

## Local Model Performance (scott_gaming, 8 GB VRAM)

| Model | Alias | Speed | Practical Context | Spec Context |
|---|---|---|---|---|
| Gemma 4 E4B | `agent-support-gemma-small` | ~25 t/s | **72k tokens** | 128k |
| Gemma 4 26B A4B (MoE) | `agent-support-gemma-medium` | ~9 t/s | **50k tokens** | 256k |

Context is VRAM-constrained — spec limits are higher but KV cache fills available VRAM first.
Techniques to improve: lower KV cache quantization, flash attention, context length tuning in Ollama.

**Practical implications for the local orchestrator:**
- System prompt + memory (T2) + tool results + history: budget ~40-50k for small, ~35-40k for medium
- Medium at 9 t/s is fine for background/async tasks; small at 25 t/s is responsive enough for interactive use
- Both are well above what's needed for most tool loop iterations (~2-5k tokens per round)

---

## Authentication

All API calls use a bearer token:

```
Authorization: Bearer sk-<api-key>
```

API keys are managed in Open WebUI → Settings → Account → API Keys.
Cortex stores these per-user in `home/{username}/local_llm.json` → `hosts[].api_key`.

---

## Core Endpoints Used by Cortex

### List Available Models

```
GET /api/models
Authorization: Bearer sk-...
```

Returns all models (Ollama, OpenAI-proxied, custom functions).
Used by `/api/local-llm/fetch-models` in `routers/local_llm.py`.

Response shape:
```json
{
  "data": [
    { "id": "gemma4-e4b", "name": "Gemma 4 E4B" },
    ...
  ]
}
```

### Chat Completions (OpenAI-compatible)

```
POST /api/chat/completions
Authorization: Bearer sk-...
Content-Type: application/json
```

Standard OpenAI chat format. Supports:
- `messages` — standard role/content array
- `model` — model ID or workspace alias
- `tools` + `tool_choice` — function calling (see Tool Loop below)
- `stream: true/false`

This is the endpoint used by `_local()` in `llm_client.py`.

### Anthropic Messages API Compatibility

```
POST /api/v1/messages
Authorization: Bearer sk-...
```

Open WebUI also accepts Anthropic-format requests and auto-converts them.
Could be used to route Claude SDK calls through Open WebUI.
Base URL for this mode: `http://192.168.32.19:3000/api`

### Direct Ollama Proxy

```
GET  /ollama/api/tags        — list models
POST /ollama/api/generate    — streaming completions
POST /ollama/api/embed       — generate embeddings
```

Use these if you need to bypass Open WebUI's filter layer and hit Ollama directly.
Ollama is also accessible directly at `http://192.168.32.19:11434`.

---

## Tool / Function Calling

Both Gemma 4 models (E4B and 26B A4B) support function calling via the standard
OpenAI `tools` parameter. Open WebUI passes this through to the underlying model.

### Request Format

```json
POST /api/chat/completions
{
  "model": "gemma4-26b-a4b",
  "messages": [
    { "role": "system", "content": "..." },
    { "role": "user",   "content": "What's the weather?" }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "web_search",
        "description": "Search the web for current information",
        "parameters": {
          "type": "object",
          "properties": {
            "query": { "type": "string", "description": "Search query" }
          },
          "required": ["query"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}
```

### Tool Call Response

When the model wants to call a tool, it returns `finish_reason: "tool_calls"`:

```json
{
  "choices": [{
    "finish_reason": "tool_calls",
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "web_search",
          "arguments": "{\"query\": \"current weather NYC\"}"
        }
      }]
    }
  }]
}
```

### Sending Tool Results Back

Append the assistant's tool_call message and a tool result message, then re-submit:

```json
{
  "messages": [
    { "role": "user",      "content": "What's the weather?" },
    { "role": "assistant", "content": null,
      "tool_calls": [{ "id": "call_abc123", "function": { "name": "web_search", "arguments": "..." } }] },
    { "role": "tool",      "tool_call_id": "call_abc123",
      "content": "Current weather in NYC: 62°F, partly cloudy." }
  ],
  "tools": [...],
  "tool_choice": "auto"
}
```

Repeat until `finish_reason: "stop"`.

---

## RAG (Retrieval Augmented Generation)

### Upload a File

```
POST /api/v1/files/
Authorization: Bearer sk-...
Content-Type: multipart/form-data

file=@/path/to/document.pdf
```

Returns a file ID. Poll `/api/v1/files/{id}/process/status` until `completed`.

### Knowledge Collections

```
POST /api/v1/knowledge/{collection_id}/file/add
{ "file_id": "..." }
```

### Use in Chat

Reference files or knowledge collections in any chat request:

```json
{
  "model": "gemma4-26b-a4b",
  "messages": [...],
  "files": [
    { "type": "file",       "id": "file-id" },
    { "type": "collection", "id": "collection-id" }
  ]
}
```

### Process a Web URL into a Collection

```
POST /api/v1/retrieval/process/web
{ "url": "https://example.com/article", "collection_id": "..." }
```

---

## Filter Behavior with Direct API Calls

Open WebUI supports inlet/outlet filter pipelines. With direct API access:

| Filter    | Runs automatically? |
|-----------|---------------------|
| `inlet()` | ✅ Yes              |
| `stream()`| ✅ Yes              |
| `outlet()`| ❌ Manual only — call `POST /api/chat/completed` after receiving response |

For Cortex's use case (tool loop orchestration), this is not a concern — we're
driving the loop ourselves and don't rely on Open WebUI's filter pipeline.

---

## Relevant Cortex Files

| File | Purpose |
|---|---|
| `cortex/llm_client.py` — `_local()` | Current local backend (direct chat only) |
| `cortex/routers/local_llm.py` | Local model settings page + fetch-models endpoint |
| `cortex/user_settings.py` | Per-user host + model config (`local_llm.json`) |
| `cortex/orchestrator_engine.py` | Gemini API tool loop — reference for local version |
| `home/{user}/local_llm.json` | Stored host/model config |

---

## Planned: Local Orchestrator (`local_orchestrator_engine.py`)

A local equivalent of `orchestrator_engine.py` that:
1. Takes the same tool definitions already registered in `cortex/tools/`
2. Converts them to OpenAI `tools` format (already close — minor schema diff from Gemini)
3. Runs a ReAct loop against the local model via `/api/chat/completions`
4. Falls back gracefully if the model doesn't return a valid tool call

See `documentation/TODO__Agents.md` — `[Local] Tool-capable local orchestrator`.

Model recommendation:
- **Gemma 4 26B A4B** (256k ctx, MoE — fast for its size) for complex tool tasks
- **Gemma 4 E4B** (128k ctx) for lightweight/fast tasks

---

## Notes

- Open WebUI workspace aliases (e.g. `agent-support-gemma-small`) resolve to the
  underlying Ollama model — use aliases in Cortex for human-friendly model names.
- `tool_choice: "auto"` lets the model decide; `"none"` forces plain text response;
  `{"type": "function", "function": {"name": "..."}}` forces a specific tool.
- Gemma 4 models support combined tool use + reasoning (thinking tokens) — useful
  for complex multi-step tasks.
- For embeddings (future RAG work), use `/ollama/api/embed` directly.