Ollama Provider
The Ollama provider enables OneLLM to work with locally running Ollama servers, supporting dynamic endpoint routing for using multiple Ollama instances.
Installation
- Install Ollama from ollama.ai
- Start the Ollama server:
ollama serve
- Pull models you want to use:
ollama pull llama3:8b ollama pull mistral:7b ollama pull llava:latest # For vision support
Configuration
Environment Variables
OLLAMA_API_BASE
- Default Ollama server URL (default:http://localhost:11434
)OLLAMA_TIMEOUT
- Request timeout in seconds (default: 120)
Programmatic Configuration
import onellm
# Set default Ollama server
onellm.update_provider_config("ollama", api_base="http://gpu-server:11434")
Model Naming Format
Ollama supports dynamic endpoint routing using the format:
ollama/model:tag@host:port
Examples:
ollama/llama3:8b
- Uses default server (localhost:11434)ollama/llama3:8b@gpu-server:11434
- Uses specific serverollama/mixtral:8x7b-instruct-q4_K_M@10.0.0.5:11434
- Uses IP addressollama/llava:latest@https://secure-server:11434
- Uses HTTPS
Usage Examples
Basic Usage
from onellm import Client
client = Client()
# Use default localhost server
response = await client.chat.completions.create(
model="ollama/llama3:8b",
messages=[{"role": "user", "content": "Hello!"}]
)
Multiple Servers
# Use different servers for different models
client = Client()
# Fast local model
local_response = await client.chat.completions.create(
model="ollama/llama3:8b",
messages=[{"role": "user", "content": "Quick question"}]
)
# Powerful remote model
remote_response = await client.chat.completions.create(
model="ollama/mixtral:8x7b@gpu-server:11434",
messages=[{"role": "user", "content": "Complex analysis"}]
)
# Specialized model on different server
special_response = await client.chat.completions.create(
model="ollama/codellama:34b@code-server:11434",
messages=[{"role": "user", "content": "Write a Python function"}]
)
Streaming
stream = await client.chat.completions.create(
model="ollama/llama3:8b",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
async for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Vision Models
# Using LLaVA for image analysis
response = await client.chat.completions.create(
model="ollama/llava:latest",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,..."}
}
]
}]
)
Ollama-Specific Parameters
response = await client.chat.completions.create(
model="ollama/llama3:8b",
messages=[{"role": "user", "content": "Hello"}],
# Ollama-specific parameters
num_gpu=1, # GPU layers to use
num_thread=8, # CPU threads
num_ctx=4096, # Context window size
temperature=0.7,
top_k=40,
top_p=0.9
)
List Available Models
from onellm.providers import get_provider
ollama = get_provider("ollama")
# List models on default server
models = await ollama.list_models()
print("Local models:", models)
# List models on remote server
remote_models = await ollama.list_models("http://gpu-server:11434")
print("Remote models:", remote_models)
Supported Models
Text Generation Models
- Llama 3 family:
llama3:8b
,llama3:70b
- Mistral family:
mistral:7b
,mixtral:8x7b
,mixtral:8x22b
- CodeLlama:
codellama:7b
,codellama:13b
,codellama:34b
- Phi-3:
phi3:mini
,phi3:medium
- Gemma:
gemma:2b
,gemma:7b
- And many more…
Vision Models
- LLaVA:
llava:latest
,llava:34b
- BakLLaVA:
bakllava:latest
- LLaVA-Llama3:
llava-llama3:latest
- LLaVA-Phi3:
llava-phi3:latest
- Moondream:
moondream:latest
- MiniCPM-V:
minicpm-v:latest
- Llama 3.2 Vision:
llama3.2-vision:11b
Features
Supported
- ✅ Chat completions
- ✅ Streaming responses
- ✅ Vision/multimodal (model-dependent)
- ✅ Multiple server endpoints
- ✅ Model listing
- ✅ Custom parameters
Not Supported
- ❌ Function calling (model-dependent)
- ❌ Embeddings (use specialized models)
- ❌ Audio processing
- ❌ File uploads
Performance Tips
- Local vs Remote: Use local models for low latency, remote for GPU power
- Model Selection: Choose appropriate model sizes for your hardware
- Quantization: Use quantized models (e.g.,
q4_K_M
) for better performance - GPU Acceleration: Configure
num_gpu
for GPU-enabled systems - Context Size: Adjust
num_ctx
based on your needs and memory
Common Issues
Ollama Server Not Running
Error: Cannot connect to Ollama server at http://localhost:11434
Solution: Start Ollama with ollama serve
Model Not Found
Error: Model 'llama3:8b' not found on http://localhost:11434
Solution: Pull the model with ollama pull llama3:8b
Timeout Errors
Large models may take time to load. Increase timeout:
import onellm
onellm.update_provider_config("ollama", timeout=300)
Memory Issues
Reduce GPU layers or use smaller/quantized models:
response = await client.chat.completions.create(
model="ollama/llama3:8b-instruct-q4_K_M", # Quantized model
messages=[{"role": "user", "content": "Hello"}],
num_gpu=0 # CPU only
)
Advanced Usage
Load Balancing
import random
# Define server pool
servers = [
"server1:11434",
"server2:11434",
"server3:11434"
]
# Random server selection
server = random.choice(servers)
model = f"ollama/llama3:8b@{server}"
response = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Hello"}]
)
Model Routing by Task
# Route models based on task type
task_models = {
"code": "ollama/codellama:34b@code-server:11434",
"chat": "ollama/llama3:8b@localhost:11434",
"analysis": "ollama/mixtral:8x7b@gpu-server:11434",
"vision": "ollama/llava:34b@vision-server:11434"
}
model = task_models.get(task_type, task_models["chat"])
response = await client.chat.completions.create(
model=model,
messages=messages
)