Docker Model Runner

Run AI models locally with Docker — no API keys, no costs, full data privacy.

Overview

Docker Model Runner (DMR) lets you run open-source AI models directly on your machine. Models run in Docker, so there’s no API key needed and no data leaves your computer.

💡 No API key needed

DMR runs models locally — your data never leaves your machine. Great for development, sensitive data, or offline use.

Prerequisites

Configuration

Inline

agents:
  root:
    model: dmr/ai/qwen3

Named Model

models:
  local:
    provider: dmr
    model: ai/qwen3
    max_tokens: 8192

Available Models

Any model available through Docker Model Runner can be used. Common options:

Model Description
ai/qwen3 Qwen 3 — versatile, good for coding and general tasks
ai/llama3.2 Llama 3.2 — Meta’s open-source model

Runtime Flags

Pass flags to the underlying inference runtime (e.g., llama.cpp) using provider_opts.runtime_flags:

models:
  local:
    provider: dmr
    model: ai/qwen3
    max_tokens: 8192
    provider_opts:
      runtime_flags: ["--threads", "8"]

Runtime flags also accept a single string:

provider_opts:
  runtime_flags: "--threads 8"

Use only flags your Model Runner backend allows (see docker model configure --help and backend docs). Do not put sampling parameters (temperature, top_p, penalties) in runtime_flags — set them on the model (temperature, top_p, etc.); they are sent per request via the OpenAI-compatible chat API.

Context size

max_tokens controls the maximum output tokens per chat completion request. To set the engine’s total context window, use provider_opts.context_size:

models:
  local:
    provider: dmr
    model: ai/qwen3
    max_tokens: 4096            # max output tokens (per-request)
    provider_opts:
      context_size: 32768       # total context window (sent via _configure)

If context_size is omitted, Model Runner uses its default. max_tokens is not used as the context window.

Thinking / reasoning budget

When using the llama.cpp backend, thinking_budget is sent as structured llamacpp.reasoning-budget on _configure (maps to --reasoning-budget). String efforts use the same token mapping as other providers; adaptive maps to unlimited (-1).

When using the vLLM backend, thinking_budget is sent as thinking_token_budget in each chat completion request. Effort levels map to token counts using the same scale as other providers; adaptive maps to unlimited (-1).

models:
  local:
    provider: dmr
    model: ai/qwen3
    thinking_budget: medium   # llama.cpp: reasoning-budget=8192; vLLM: thinking_token_budget=8192

On MLX and SGLang backends, thinking_budget is silently ignored — those engines do not currently expose a per-request reasoning token budget knob.

vLLM-specific configuration

When running a model on the vLLM backend, additional engine-level settings can be passed via provider_opts and are forwarded to model-runner’s _configure endpoint:

models:
  vllm-local:
    provider: dmr
    model: ai/some-model-safetensors
    provider_opts:
      gpu_memory_utilization: 0.9
      hf_overrides:
        max_model_len: 8192
        dtype: bfloat16

hf_overrides keys (including nested ones) must match ^[a-zA-Z_][a-zA-Z0-9_]*$ — the same rule model-runner enforces server-side to block injection via flags. Invalid keys are rejected at client creation time so you fail fast instead of after a round-trip.

These options are ignored on non-vLLM backends.

Keeping models resident in memory (keep_alive)

By default model-runner unloads idle models after a few minutes. Override the idle timeout via provider_opts.keep_alive:

models:
  sticky:
    provider: dmr
    model: ai/qwen3
    provider_opts:
      keep_alive: "30m"   # duration string
      # keep_alive: "0"   # unload immediately after each request
      # keep_alive: "-1"  # keep loaded forever

Accepted values: any Go duration string ("30s", "5m", "1h", "2h30m"), "0" (immediate unload), or "-1" (never unload). Invalid values are rejected before the configure request is sent.

Unloading models on agent switch

In multi-agent setups where two DMR models can’t fit in GPU memory simultaneously, wire the unload built-in hook into each agent’s on_agent_switch chain. Every time the active agent transfers control, the runtime POSTs to the engine’s _unload endpoint to free the previous model’s resources before the next one is loaded:

agents:
  coder:
    model: qwen3-large
    handoffs: [reviewer]
    hooks:
      on_agent_switch:
        - type: builtin
          command: unload
  reviewer:
    model: qwen3-coder
    handoffs: [coder]
    hooks:
      on_agent_switch:
        - type: builtin
          command: unload

The unload URL is derived from base_url by replacing the trailing /v1 segment (e.g. http://127.0.0.1:12434/engines/llama.cpp/v1/http://127.0.0.1:12434/engines/llama.cpp/_unload). Override it explicitly via the provider-level unload_api field when running against a non-standard model-runner deployment:

providers:
  my_dmr:
    provider: dmr
    base_url: http://model-runner.docker.internal/engines/v1
    unload_api: /engines/_unload   # default; absolute URLs also work

models:
  big:
    provider: my_dmr
    model: ai/qwen3

Unload errors are logged and swallowed — a stuck or unreachable engine never blocks an agent transfer (each call is bounded to 10 s). Pair this with keep_alive only when you want the model to also survive idle periods within a single agent’s run; the hook controls between-agent unloads independently.

⚠️ Single-tenant assumption

The _unload endpoint is engine-level: it evicts the model from DMR’s memory regardless of who is using it. If two concurrent sessions on the same runtime (e.g. an API server serving multiple users) hit the same agent, switching away in session A will yank the model out from under session B’s in-flight request, which then has to wait for a reload. Wire unload only when the agents using these models are not run concurrently — typically a single TUI/CLI session.

See examples/unload_on_switch.yaml for the full example.

Operating mode (mode)

Model-runner normally infers the backend mode from the request path. You can pin it explicitly via provider_opts.mode:

provider_opts:
  mode: embedding   # one of: completion, embedding, reranking, image-generation

Most agents don’t need this — leave it unset unless you know you need it.

Raw runtime flags (raw_runtime_flags)

runtime_flags (a list) is the preferred way to pass flags. If you have a pre-built command-line string you’d rather ship verbatim, use raw_runtime_flags instead:

provider_opts:
  raw_runtime_flags: "--threads 8 --batch-size 512"

Model-runner parses the string with shell-style word splitting. runtime_flags and raw_runtime_flags are mutually exclusive — setting both is an error.

Speculative Decoding

Use a smaller draft model to predict tokens ahead for faster inference:

models:
  fast-local:
    provider: dmr
    model: ai/qwen3:14B
    max_tokens: 8192
    provider_opts:
      speculative_draft_model: ai/qwen3:0.6B-F16
      speculative_num_tokens: 16
      speculative_acceptance_rate: 0.8

Custom Endpoint

If base_url is omitted, docker-agent auto-discovers the DMR endpoint. To set manually:

models:
  local:
    provider: dmr
    model: ai/qwen3
    base_url: http://127.0.0.1:12434/engines/llama.cpp/v1

Troubleshooting