Advanced Features
This guide covers advanced features and configurations in OneLLM, including fallback mechanisms, retry strategies, and working with multiple providers.
Fallback Mechanism
OneLLM provides intelligent fallback between providers when failures occur. This ensures high availability and resilience in production environments.
Basic Fallback
Use multiple models in order of preference:
from onellm import ChatCompletion
# Fallback chain: try each model in order
response = ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
fallback_models=[
"anthropic/claude-3-opus",
"google/gemini-pro"
]
)
Provider-Specific Fallback
Mix models from different providers:
from onellm import ChatCompletion
# Production-ready fallback strategy
response = ChatCompletion.create(
model="openai/gpt-4-turbo", # Primary model
messages=[{"role": "user", "content": "Analyze this data..."}],
fallback_models=[
"anthropic/claude-3-opus", # Fallback 1: Different provider
"openai/gpt-3.5-turbo", # Fallback 2: Same provider, cheaper
"ollama/llama2" # Fallback 3: Local model
]
)
Conditional Fallback
Fallback only occurs when specific errors happen:
- Rate Limits: Automatically switches to next provider
- Service Unavailable: Tries alternative providers
- Authentication Errors: Skips to next (won’t retry)
- Invalid Model: Moves to next available model
Fallback Events
Monitor fallback events:
import logging
from onellm import ChatCompletion
logging.basicConfig(level=logging.INFO)
# OneLLM logs fallback attempts
response = ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
fallback_models=["anthropic/claude-3-opus"]
)
# Logs will show:
# INFO: Attempting openai/gpt-4...
# WARNING: openai/gpt-4 failed: Rate limit exceeded
# INFO: Falling back to anthropic/claude-3-opus...
# INFO: Successfully used anthropic/claude-3-opus
Retry Configuration
Control how OneLLM retries failed requests.
Max Retries
For transient errors, OneLLM can retry the same model multiple times before falling back:
from onellm import ChatCompletion
# Configure retries per request
response = ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
retries=3, # Will try the same model up to 3 additional times if it fails
fallback_models=["anthropic/claude-3-haiku", "openai/gpt-3.5-turbo"]
)
Retry Delays
OneLLM uses exponential backoff with jitter:
# Retry delays (approximately):
# Attempt 1: Immediate
# Attempt 2: ~1 second
# Attempt 3: ~2 seconds
# Attempt 4: ~4 seconds
# Attempt 5: ~8 seconds
Custom Retry Logic
Implement custom retry behavior:
from onellm import ChatCompletion
from onellm.errors import RateLimitError
import time
def custom_retry(func, max_attempts=3):
for attempt in range(max_attempts):
try:
return func()
except RateLimitError as e:
if attempt < max_attempts - 1:
wait_time = min(2 ** attempt, 60) # Exponential backoff
print(f"Rate limited, waiting {wait_time}s...")
time.sleep(wait_time)
else:
raise
# Use with custom logic
response = custom_retry(
lambda: ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
)
Fallback and Retry Architecture
Here’s how OneLLM handles failures, retries, and fallbacks internally:
---
config:
look: handDrawn
theme: mc
themeVariables:
background: 'transparent'
primaryColor: '#fff0'
secondaryColor: 'transparent'
tertiaryColor: 'transparent'
mainBkg: 'transparent'
flowchart:
layout: fixed
---
flowchart TD
START(["Client Request"]) --> REQUEST["Chat/Completion Request"]
REQUEST --> PRIMARY["Primary Model<br>e.g., openai/gpt-4"]
PRIMARY --> API_CHECK{"API<br>Available?"}
API_CHECK -->|Yes| MODEL_CHECK{"Model<br>Available?"}
MODEL_CHECK -->|Yes| QUOTA_CHECK{"Quota/Rate<br>Limits OK?"}
QUOTA_CHECK -->|Yes| SUCCESS["Successful Response"]
SUCCESS --> RESPONSE(["Return to Client"])
API_CHECK -->|No| RETRY_DECISION{"Retry<br>Count < Max?"}
MODEL_CHECK -->|No| RETRY_DECISION
QUOTA_CHECK -->|No| RETRY_DECISION
RETRY_DECISION -->|Yes| RETRY["Retry with Delay<br>(Same Model)"]
RETRY --> PRIMARY
RETRY_DECISION -->|No| FALLBACK_CHECK{"Fallbacks<br>Available?"}
FALLBACK_CHECK -->|Yes| FALLBACK_MODEL["Next Fallback Model<br>e.g., anthropic/claude-3-haiku"]
FALLBACK_MODEL --> FALLBACK_TRY["Try Fallback"]
FALLBACK_TRY --> FALLBACK_API_CHECK{"API<br>Available?"}
FALLBACK_API_CHECK -->|Yes| FALLBACK_SUCCESS["Successful Response"]
FALLBACK_SUCCESS --> RESPONSE
FALLBACK_API_CHECK -->|No| NEXT_FALLBACK{"More<br>Fallbacks?"}
NEXT_FALLBACK -->|Yes| FALLBACK_MODEL
NEXT_FALLBACK -->|No| ERROR["Error Response"]
FALLBACK_CHECK -->|No| ERROR
ERROR --> RESPONSE
This flow ensures maximum reliability by:
- Retrying transient failures with exponential backoff
- Falling back to alternative models when retries are exhausted
- Trying multiple fallback options in sequence
- Returning detailed error information when all options fail
Working with Local Models
OneLLM supports local models through Ollama and llama.cpp providers.
Using Ollama
Configure Ollama endpoint:
from onellm import ChatCompletion
import os
# Set Ollama base URL if not using default
os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"
# Use Ollama models
response = ChatCompletion.create(
model="ollama/llama2",
messages=[{"role": "user", "content": "Hello!"}]
)
Using llama.cpp
For direct GGUF model execution:
from onellm import ChatCompletion
import os
# Set path to your GGUF model
os.environ["LLAMA_CPP_MODEL_PATH"] = "/path/to/model.gguf"
# Use llama.cpp provider
response = ChatCompletion.create(
model="llama_cpp/model",
messages=[{"role": "user", "content": "Hello!"}]
)
Streaming with Fallback
Combine streaming with fallback for resilient real-time responses:
from onellm import ChatCompletion
stream = ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "Write a story..."}],
fallback_models=["anthropic/claude-3-opus"],
stream=True
)
# If streaming fails mid-response, OneLLM can:
# 1. Continue with the partial response
# 2. Restart with fallback model
# 3. Merge responses intelligently
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Request Timeout Configuration
Control timeouts through environment variables:
from onellm import ChatCompletion
import os
# Set timeout via environment variable
os.environ["ONELLM_TIMEOUT"] = "60" # 60 seconds
# Make request with configured timeout
response = ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
Concurrent Requests
Handle multiple requests efficiently:
import asyncio
from onellm import ChatCompletion
async def process_many():
# Process multiple requests concurrently
tasks = []
for i in range(10):
task = ChatCompletion.acreate(
model="openai/gpt-3.5-turbo",
messages=[{"role": "user", "content": f"Request {i}"}],
fallback_models=["anthropic/claude-3-haiku"]
)
tasks.append(task)
# Wait for all to complete
responses = await asyncio.gather(*tasks)
return responses
# Run concurrent requests
responses = asyncio.run(process_many())
JSON Mode
For structured outputs, OneLLM supports JSON mode with compatible providers:
from onellm import ChatCompletion
import json
response = ChatCompletion.create(
model="openai/gpt-4o",
messages=[
{"role": "user", "content": "List the top 3 programming languages with their key features"}
],
response_format={"type": "json_object"} # Request JSON output
)
# The response contains valid, parseable JSON
json_response = response.choices[0].message["content"]
structured_data = json.loads(json_response)
Logging and Monitoring
Enable logging to monitor OneLLM operations:
import logging
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# OneLLM will log:
# - Provider selection
# - Retry attempts
# - Fallback decisions
# - Errors and warnings
Environment Variables
OneLLM supports configuration through environment variables:
import os
# Set API keys
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
# Set timeouts
os.environ["ONELLM_TIMEOUT"] = "60"
# Set Ollama endpoint
os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"
# Set llama.cpp model path
os.environ["LLAMA_CPP_MODEL_PATH"] = "/path/to/model.gguf"
Best Practices
1. Always Use Fallbacks in Production
# Bad: Single point of failure
response = ChatCompletion.create(
model="openai/gpt-4",
messages=messages
)
# Good: Multiple fallback options
response = ChatCompletion.create(
model="openai/gpt-4",
messages=messages,
fallback_models=["anthropic/claude-3-opus", "openai/gpt-3.5-turbo"]
)
2. Configure Appropriate Timeouts
import os
# Set reasonable timeout via environment
os.environ["ONELLM_TIMEOUT"] = "30" # 30 seconds
# Make requests with configured timeout
response = ChatCompletion.create(
model="openai/gpt-3.5-turbo",
messages=messages
)
3. Monitor and Log Failures
import logging
# Enable logging to track issues
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("onellm")
# Logs will show fallback attempts and errors
4. Use Local Models for Development
# Development: Use local models
dev_model = "ollama/llama2"
# Production: Use cloud providers with fallbacks
prod_model = "openai/gpt-4"
prod_fallbacks = ["anthropic/claude-3-opus", "openai/gpt-3.5-turbo"]
5. Handle Errors Gracefully
from onellm import ChatCompletion
from onellm.errors import OneLLMError
try:
response = ChatCompletion.create(
model="openai/gpt-4",
messages=messages,
fallback_models=["anthropic/claude-3-opus"]
)
except OneLLMError as e:
# Handle OneLLM-specific errors
logger.error(f"LLM request failed: {e}")
# Use a default response or retry later
Next Steps
- Provider Capabilities - Compare provider features
- Error Handling - Handle errors gracefully
- Best Practices - Production recommendations