Local Models (Ollama, vLLM, LocalAI)

Run cagent with locally hosted models for privacy, offline use, or cost savings.

Overview

cagent can connect to any OpenAI-compatible local model server. This guide covers the most popular options:

💡 Docker Model Runner

For the easiest local model experience, consider Docker Model Runner which is built into Docker Desktop and requires no additional setup.

Ollama

Ollama is a popular tool for running LLMs locally. cagent includes a built-in ollama alias for easy configuration.

Setup

  1. Install Ollama from ollama.ai
  2. Pull a model:

    ollama pull llama3.2
    ollama pull qwen2.5-coder
    
  3. Start the Ollama server (usually runs automatically):

    ollama serve
    

Configuration

Use the built-in ollama alias:

agents:
  root:
    model: ollama/llama3.2
    description: Local assistant
    instruction: You are a helpful assistant.

The ollama alias automatically uses:

Custom Port or Host

If Ollama runs on a different host or port:

models:
  my_ollama:
    provider: ollama
    model: llama3.2
    base_url: http://192.168.1.100:11434/v1

agents:
  root:
    model: my_ollama
    description: Remote Ollama assistant
    instruction: You are a helpful assistant.
Model Size Best For
llama3.2 3B General purpose, fast
llama3.1 8B Better reasoning
qwen2.5-coder 7B Code generation
mistral 7B General purpose
codellama 7B Code tasks
deepseek-coder 6.7B Code generation

vLLM

vLLM is a high-performance inference server optimized for throughput.

Setup

# Install vLLM
pip install vllm

# Start the server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-3B-Instruct \
  --port 8000

Configuration

providers:
  vllm:
    api_type: openai_chatcompletions
    base_url: http://localhost:8000/v1

agents:
  root:
    model: vllm/meta-llama/Llama-3.2-3B-Instruct
    description: vLLM-powered assistant
    instruction: You are a helpful assistant.

LocalAI

LocalAI provides an OpenAI-compatible API that works with various backends.

Setup

# Run with Docker
docker run -p 8080:8080 --name local-ai \
  -v ./models:/models \
  localai/localai:latest-cpu

Configuration

providers:
  localai:
    api_type: openai_chatcompletions
    base_url: http://localhost:8080/v1

agents:
  root:
    model: localai/gpt4all-j
    description: LocalAI assistant
    instruction: You are a helpful assistant.

Generic Custom Provider

For any OpenAI-compatible server:

providers:
  my_server:
    api_type: openai_chatcompletions
    base_url: http://localhost:8000/v1
    # token_key: MY_API_KEY  # if auth required

agents:
  root:
    model: my_server/model-name
    description: Custom server assistant
    instruction: You are a helpful assistant.

Performance Tips

ℹ️ Local Model Considerations
- **Memory:** Larger models need more RAM/VRAM. A 7B model typically needs 8-16GB RAM. - **GPU:** GPU acceleration dramatically improves speed. Check your server's GPU support. - **Context length:** Local models often have smaller context windows than cloud models. - **Tool calling:** Not all local models support function/tool calling. Test your model's capabilities.

Example: Offline Development Agent

agents:
  developer:
    model: ollama/qwen2.5-coder
    description: Offline code assistant
    instruction: |
      You are a software developer working offline.
      Focus on code quality and clear explanations.
    max_iterations: 20
    toolsets:
      - type: filesystem
      - type: shell
      - type: think
      - type: todo

Troubleshooting

Connection Refused

Ensure your model server is running and accessible:

curl http://localhost:11434/v1/models  # Ollama
curl http://localhost:8000/v1/models   # vLLM

Model Not Found

Verify the model is downloaded/available:

ollama list  # List available Ollama models

Slow Responses