Local Models (Ollama, vLLM, LocalAI)
Run cagent with locally hosted models for privacy, offline use, or cost savings.
Overview
cagent can connect to any OpenAI-compatible local model server. This guide covers the most popular options:
- Ollama — Easy-to-use local model runner
- vLLM — High-performance inference server
- LocalAI — OpenAI-compatible API for various backends
For the easiest local model experience, consider Docker Model Runner which is built into Docker Desktop and requires no additional setup.
Ollama
Ollama is a popular tool for running LLMs locally. cagent includes a built-in ollama alias for easy configuration.
Setup
- Install Ollama from ollama.ai
-
Pull a model:
ollama pull llama3.2 ollama pull qwen2.5-coder -
Start the Ollama server (usually runs automatically):
ollama serve
Configuration
Use the built-in ollama alias:
agents:
root:
model: ollama/llama3.2
description: Local assistant
instruction: You are a helpful assistant.
The ollama alias automatically uses:
- Base URL:
http://localhost:11434/v1 - API Type: OpenAI-compatible
- No API key required
Custom Port or Host
If Ollama runs on a different host or port:
models:
my_ollama:
provider: ollama
model: llama3.2
base_url: http://192.168.1.100:11434/v1
agents:
root:
model: my_ollama
description: Remote Ollama assistant
instruction: You are a helpful assistant.
Popular Ollama Models
| Model | Size | Best For |
|---|---|---|
llama3.2 |
3B | General purpose, fast |
llama3.1 |
8B | Better reasoning |
qwen2.5-coder |
7B | Code generation |
mistral |
7B | General purpose |
codellama |
7B | Code tasks |
deepseek-coder |
6.7B | Code generation |
vLLM
vLLM is a high-performance inference server optimized for throughput.
Setup
# Install vLLM
pip install vllm
# Start the server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--port 8000
Configuration
providers:
vllm:
api_type: openai_chatcompletions
base_url: http://localhost:8000/v1
agents:
root:
model: vllm/meta-llama/Llama-3.2-3B-Instruct
description: vLLM-powered assistant
instruction: You are a helpful assistant.
LocalAI
LocalAI provides an OpenAI-compatible API that works with various backends.
Setup
# Run with Docker
docker run -p 8080:8080 --name local-ai \
-v ./models:/models \
localai/localai:latest-cpu
Configuration
providers:
localai:
api_type: openai_chatcompletions
base_url: http://localhost:8080/v1
agents:
root:
model: localai/gpt4all-j
description: LocalAI assistant
instruction: You are a helpful assistant.
Generic Custom Provider
For any OpenAI-compatible server:
providers:
my_server:
api_type: openai_chatcompletions
base_url: http://localhost:8000/v1
# token_key: MY_API_KEY # if auth required
agents:
root:
model: my_server/model-name
description: Custom server assistant
instruction: You are a helpful assistant.
Performance Tips
Example: Offline Development Agent
agents:
developer:
model: ollama/qwen2.5-coder
description: Offline code assistant
instruction: |
You are a software developer working offline.
Focus on code quality and clear explanations.
max_iterations: 20
toolsets:
- type: filesystem
- type: shell
- type: think
- type: todo
Troubleshooting
Connection Refused
Ensure your model server is running and accessible:
curl http://localhost:11434/v1/models # Ollama
curl http://localhost:8000/v1/models # vLLM
Model Not Found
Verify the model is downloaded/available:
ollama list # List available Ollama models
Slow Responses
- Check if GPU acceleration is enabled
- Try a smaller model
- Reduce
max_tokensin your config