Docker Model Runner

Run AI models locally with Docker — no API keys, no costs, full data privacy.

Overview

Docker Model Runner (DMR) lets you run open-source AI models directly on your machine. Models run in Docker, so there’s no API key needed and no data leaves your computer.

💡 No API key needed

DMR runs models locally — your data never leaves your machine. Great for development, sensitive data, or offline use.

Prerequisites

Configuration

Inline

agents:
  root:
    model: dmr/ai/qwen3

Named Model

models:
  local:
    provider: dmr
    model: ai/qwen3
    max_tokens: 8192

Available Models

Any model available through Docker Model Runner can be used. Common options:

Model Description
ai/qwen3 Qwen 3 — versatile, good for coding and general tasks
ai/llama3.2 Llama 3.2 — Meta’s open-source model

Runtime Flags

Pass flags to the underlying inference runtime (e.g., llama.cpp) using provider_opts.runtime_flags:

models:
  local:
    provider: dmr
    model: ai/qwen3
    max_tokens: 8192
    provider_opts:
      runtime_flags: ["--threads", "8"]

Runtime flags also accept a single string:

provider_opts:
  runtime_flags: "--threads 8"

Use only flags your Model Runner backend allows (see docker model configure --help and backend docs). Do not put sampling parameters (temperature, top_p, penalties) in runtime_flags — set them on the model (temperature, top_p, etc.); they are sent per request via the OpenAI-compatible chat API.

Context size

max_tokens controls the maximum output tokens per chat completion request. To set the engine’s total context window, use provider_opts.context_size:

models:
  local:
    provider: dmr
    model: ai/qwen3
    max_tokens: 4096            # max output tokens (per-request)
    provider_opts:
      context_size: 32768       # total context window (sent via _configure)

If context_size is omitted, Model Runner uses its default. max_tokens is not used as the context window.

Thinking / reasoning budget

When using the llama.cpp backend, thinking_budget is sent as structured llamacpp.reasoning-budget on _configure (maps to --reasoning-budget). String efforts use the same token mapping as other providers; adaptive maps to unlimited (-1).

When using the vLLM backend, thinking_budget is sent as thinking_token_budget in each chat completion request. Effort levels map to token counts using the same scale as other providers; adaptive maps to unlimited (-1).

models:
  local:
    provider: dmr
    model: ai/qwen3
    thinking_budget: medium   # llama.cpp: reasoning-budget=8192; vLLM: thinking_token_budget=8192

On MLX and SGLang backends, thinking_budget is silently ignored — those engines do not currently expose a per-request reasoning token budget knob.

vLLM-specific configuration

When running a model on the vLLM backend, additional engine-level settings can be passed via provider_opts and are forwarded to model-runner’s _configure endpoint:

models:
  vllm-local:
    provider: dmr
    model: ai/some-model-safetensors
    provider_opts:
      gpu_memory_utilization: 0.9
      hf_overrides:
        max_model_len: 8192
        dtype: bfloat16

hf_overrides keys (including nested ones) must match ^[a-zA-Z_][a-zA-Z0-9_]*$ — the same rule model-runner enforces server-side to block injection via flags. Invalid keys are rejected at client creation time so you fail fast instead of after a round-trip.

These options are ignored on non-vLLM backends.

Keeping models resident in memory (keep_alive)

By default model-runner unloads idle models after a few minutes. Override the idle timeout via provider_opts.keep_alive:

models:
  sticky:
    provider: dmr
    model: ai/qwen3
    provider_opts:
      keep_alive: "30m"   # duration string
      # keep_alive: "0"   # unload immediately after each request
      # keep_alive: "-1"  # keep loaded forever

Accepted values: any Go duration string ("30s", "5m", "1h", "2h30m"), "0" (immediate unload), or "-1" (never unload). Invalid values are rejected before the configure request is sent.

Operating mode (mode)

Model-runner normally infers the backend mode from the request path. You can pin it explicitly via provider_opts.mode:

provider_opts:
  mode: embedding   # one of: completion, embedding, reranking, image-generation

Most agents don’t need this — leave it unset unless you know you need it.

Raw runtime flags (raw_runtime_flags)

runtime_flags (a list) is the preferred way to pass flags. If you have a pre-built command-line string you’d rather ship verbatim, use raw_runtime_flags instead:

provider_opts:
  raw_runtime_flags: "--threads 8 --batch-size 512"

Model-runner parses the string with shell-style word splitting. runtime_flags and raw_runtime_flags are mutually exclusive — setting both is an error.

Speculative Decoding

Use a smaller draft model to predict tokens ahead for faster inference:

models:
  fast-local:
    provider: dmr
    model: ai/qwen3:14B
    max_tokens: 8192
    provider_opts:
      speculative_draft_model: ai/qwen3:0.6B-F16
      speculative_num_tokens: 16
      speculative_acceptance_rate: 0.8

Custom Endpoint

If base_url is omitted, docker-agent auto-discovers the DMR endpoint. To set manually:

models:
  local:
    provider: dmr
    model: ai/qwen3
    base_url: http://127.0.0.1:12434/engines/llama.cpp/v1

Troubleshooting