Semantic Caching
OneLLM includes an intelligent semantic caching layer that reduces API costs by 50-80% and improves response times for repeated or similar queries.
Overview
The cache uses a hybrid two-tier approach with automatic expiration and streaming support:
- Hash-based exact matching - Instant cache hits (~3.5Β΅s) for identical queries - 42,000-143,000x faster
- Semantic similarity matching - Fast similarity search (~18ms) for near-duplicate queries - 10-30x faster
- TTL auto-expiration - Entries expire after configurable time (default: 1 day) with refresh-on-access
- Streaming simulation - Cached responses chunked naturally to preserve streaming UX
Key Benefits:
- π° Reduces API costs by 50-80% in production
- β‘ Blazing fast - 42,000-143,000x speedup for exact matches, 10-30x for semantic
- πΊ Streaming support - Both streaming and non-streaming requests benefit from cache
- β±οΈ Auto-expiration - TTL prevents stale data with refresh-on-access
- π Multilingual support - Works with 50+ languages
- π΅ Zero ongoing costs - Uses local embeddings, no API calls
- π Privacy-focused - All processing happens locally
Quick Start
import onellm
from onellm import ChatCompletion
# Enable cache once at startup
onellm.init_cache()
# Use OneLLM normally - responses are cached automatically
response = ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "What is Python?"}]
)
# First call: ~2000ms (API call + cached)
response = ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "What is Python?"}]
)
# Second call: <1ms (hash cache hit - exact match)
response = ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "Tell me about the Python programming language"}]
)
# Third call: ~18ms (semantic cache hit - 95%+ similar)
How It Works
Two-Tier Architecture
βββββββββββββββββββββββββββββββββββββββ
β User Request β
ββββββββββββββββ¬βββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β 1. Hash Lookup (~2Β΅s) β
β β’ SHA256 of request β
β β’ OrderedDict (LRU) β
β β’ Exact match only β
ββββββββββββββββ¬βββββββββββββββββββββββ
β
ββββββββ΄βββββββ
β Hit? β
ββββββββ¬βββββββ
Yes β No
β β
β βΌ
β βββββββββββββββββββββββββββββββ
β β 2. Semantic Search (~18ms) β
β β β’ Extract text content β
β β β’ Generate embedding β
β β β’ FAISS similarity β
β β β’ Threshold: 0.95 β
β ββββββββββββ¬βββββββββββββββββββ
β β
β ββββββββ΄βββββββ
β β Similarity β
β β > 0.95? β
β ββββββββ¬βββββββ
β Yes β No
β β β
βΌ βΌ βΌ
ββββββββββββββββββββββ ββββββββββββββββ
β Return cached β β API Call β
β response β β + Cache β
ββββββββββββββββββββββ ββββββββββββββββ
Cache Key Generation
Included in cache key (must match):
model- Model identifiermessages- Conversation historytemperature- Sampling temperaturemax_tokens- Response length limitresponse_format- JSON mode, etc.- All other generation parameters
Excluded from cache key (ignored):
stream- Streaming flag (cached responses can be returned as streams)timeout- Request timeoutmetadata- Custom metadata fields
Embedding Model
OneLLM uses paraphrase-multilingual-MiniLM-L12-v2:
- Size: 118MB (one-time download)
- Languages: 50+ (English, Spanish, French, German, Chinese, etc.)
- Dimensions: 384D
- Speed: ~18ms per query on CPU
- Quality: 95% default similarity threshold
Configuration
Basic Configuration
import onellm
# Default settings (recommended)
onellm.init_cache()
# Full configuration options
onellm.init_cache(
max_entries=1000, # LRU eviction limit (default: 1000)
p=0.95, # Similarity threshold (default: 0.95)
hash_only=False, # Disable semantic matching (default: False)
stream_chunk_strategy="words", # Chunking: words|sentences|paragraphs|characters
stream_chunk_length=8, # Chunk size (default: 8)
ttl=86400 # Time-to-live in seconds (default: 86400 = 1 day)
)
# More aggressive matching (catches more similar queries)
onellm.init_cache(p=0.90) # p is shorthand for similarity_threshold
# Less aggressive (only very similar queries)
onellm.init_cache(p=0.98)
# Larger cache for long-running applications
onellm.init_cache(max_entries=5000) # Default: 1000
# Shorter TTL for frequently changing data
onellm.init_cache(ttl=3600) # 1 hour (default: 86400 = 1 day)
# Configure streaming chunk behavior
onellm.init_cache(
stream_chunk_strategy="sentences", # words|sentences|paragraphs|characters
stream_chunk_length=2 # 2 sentences per chunk
)
# Combine options for production
onellm.init_cache(p=0.92, max_entries=10000, ttl=7200)
Advanced Configuration
# Hash-only mode (skip semantic model load, exact matches only)
onellm.init_cache(hash_only=True)
# Custom TTL with refresh-on-access
onellm.init_cache(ttl=3600) # Entries expire after 1 hour of no access
# Note: Accessing an entry refreshes its TTL
Cache Management
# Get statistics
stats = onellm.cache_stats()
print(stats)
# {'hits': 15, 'misses': 5, 'entries': 10}
# Calculate hit rate
hit_rate = stats['hits'] / (stats['hits'] + stats['misses'])
print(f"Hit rate: {hit_rate:.1%}") # 75.0%
# Clear all cached entries
onellm.clear_cache()
# Disable cache
onellm.disable_cache()
# Re-enable with different settings
onellm.init_cache(p=0.85)
Use Cases
β Ideal For
Long-running processes:
- Web applications (Flask, FastAPI, Django)
- API services and microservices
- Background workers and daemons
- Development servers
- Jupyter notebooks
- Testing suites (cache persists across tests in same process)
Scenarios:
- Development and testing (repeated similar queries)
- Production with high query duplication
- Applications with common user questions
- Chatbots with FAQ-style queries
β οΈ Limited Benefit For
Short-lived processes:
- One-off scripts that exit immediately
- CLI tools that run and exit
- Batch jobs that restart frequently
Reason: Cache is memory-only and doesnβt persist across process restarts. Each run starts with an empty cache and requires ~13s model load.
Performance
Benchmarks
Operation Latency Speedup vs API Cost
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GPT-4 API call 150-500ms 1x (baseline) $0.015
Hash cache hit 3.5Β΅s 42,000-143,000x $0
Semantic cache hit ~18ms 10-30x $0
Streaming simulation 3.5Β΅s+ instant return $0
Model load (one-time) ~13s - $0
Cache overhead on miss: ~3Β΅s (<0.001% of request time)
Memory per entry: ~1-2KB
Model size: 118MB
TTL expiration: Automatic with refresh-on-access
Key Insights:
- Even semantic matching (~18ms) is 10-30x faster than API calls
- Cache overhead is essentially zero compared to API latency
- Streaming responses are cached and simulated naturally
- TTL auto-expiration prevents stale data accumulation
Typical Savings
Development:
- Cache hit rate: 60-80%
- Cost reduction: 60-80%
- Time saved: 60-80% of API latency
Production (with similar queries):
- Cache hit rate: 20-40%
- Cost reduction: 20-40%
- Time saved: 20-40% of API latency
Limitations
Memory-Only (MVP)
The cache does not persist across application restarts:
- Cache is stored in RAM only
- Each process restart starts with empty cache
- No file-based or database persistence
Future: Persistence (SQLite backend) may be added in v1.1 if requested.
Streaming Support with Natural Chunking
Streaming responses are fully cached and simulated naturally:
# Streaming requests check cache and simulate streaming from cached responses
for chunk in ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "What is Python?"}],
stream=True
):
print(chunk.choices[0].delta.content, end="", flush=True)
# If cached: returns instantly, chunks naturally to preserve streaming UX
# If not cached: makes API call, streams real-time, and caches for next time
How it works:
- Cache hit: Complete response is chunked and yielded naturally (feels like streaming)
- Cache miss: Real API streaming response is accumulated and cached
- Cost savings: Even streaming requests benefit from cache
- UX preserved: Users still see natural streaming behavior
Chunking strategies:
words(default): 8 words per chunk - natural for general textsentences: 8 sentences per chunk - good for structured contentparagraphs: 8 paragraphs per chunk - for longer form contentcharacters: 8 characters per chunk - precise control
Configure chunking:
onellm.init_cache(
stream_chunk_strategy="sentences", # or words, paragraphs, characters
stream_chunk_length=2 # chunks per yield
)
Thread Safety
Basic thread safety is provided by Pythonβs GIL:
- Simple dict operations are thread-safe
- No explicit locks or synchronization
- Suitable for most applications
Future: Explicit thread-safety guarantees may be added in v1.2+ if issues are reported.
Examples
Development Workflow
import onellm
# Initialize once at startup
onellm.init_cache()
# During development, repeatedly test similar prompts
for prompt in ["What is Python?", "Tell me about Python", "Explain Python"]:
response = ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": prompt}]
)
# First: API call, subsequent: cache hits
print(response.choices[0].message["content"])
# Check how much you saved
stats = onellm.cache_stats()
print(f"Saved {stats['hits']} API calls!")
Production API Service
from flask import Flask, request, jsonify
import onellm
app = Flask(__name__)
# Initialize cache at startup
onellm.init_cache(max_entries=5000, p=0.93)
@app.route('/chat', methods=['POST'])
def chat():
data = request.json
response = ChatCompletion.create(
model="openai/gpt-4",
messages=data['messages']
)
return jsonify(response.choices[0].message)
if __name__ == '__main__':
app.run()
A/B Testing Cache Thresholds
import onellm
# Test different similarity thresholds
thresholds = [0.90, 0.93, 0.95, 0.98]
for threshold in thresholds:
onellm.clear_cache()
onellm.init_cache(p=threshold)
# Run test queries
queries = [...] # Your test queries
for query in queries:
response = ChatCompletion.create(...)
stats = onellm.cache_stats()
hit_rate = stats['hits'] / (stats['hits'] + stats['misses'])
print(f"Threshold {threshold}: {hit_rate:.1%} hit rate")
Monitoring Cache Performance
import onellm
import time
onellm.init_cache()
# Wrapper to track timing
def timed_chat_completion(**kwargs):
start = time.time()
response = ChatCompletion.create(**kwargs)
elapsed = time.time() - start
stats = onellm.cache_stats()
cache_status = "HIT" if stats['hits'] > prev_hits else "MISS"
print(f"{cache_status}: {elapsed*1000:.1f}ms")
return response
# Use wrapper
response = timed_chat_completion(
model="openai/gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
Troubleshooting
Suppressing Cache Warnings
The cache logger uses Pythonβs standard logging and can be configured to suppress warnings:
Suppress cache warnings only:
import logging
logging.getLogger("onellm.cache").setLevel(logging.ERROR)
Suppress all onellm logging:
import logging
logging.getLogger("onellm").setLevel(logging.ERROR)
Disable onellm logging completely:
import logging
logging.getLogger("onellm").addHandler(logging.NullHandler())
logging.getLogger("onellm").propagate = False
Filter specific messages:
import logging
class CacheWarningFilter(logging.Filter):
def filter(self, record):
return "Failed to add to semantic cache" not in record.getMessage()
logging.getLogger("onellm.cache").addFilter(CacheWarningFilter())
Note: Configure logging before calling onellm.init_cache() for best results.
Cache Not Working
Check if cache is initialized:
stats = onellm.cache_stats()
if stats['hits'] == 0 and stats['misses'] == 0:
print("Cache not initialized. Call onellm.init_cache()")
Note on streaming:
# Streaming requests ARE cached and simulated naturally
# Both stream=True and stream=False benefit from cache
response = ChatCompletion.create(..., stream=True) # Will use cache
response = ChatCompletion.create(..., stream=False) # Will use cache
Slow First Query
The first query after init_cache() loads the embedding model (~13s):
- This is a one-time cost per process
- Subsequent queries are fast
- Consider loading cache in startup code
Low Hit Rate
Possible causes:
- Similarity threshold too high - Try lowering:
init_cache(p=0.90) - Queries are genuinely different - Check query similarity
- Parameters changing - temperature, max_tokens, etc. affect cache key
- Different models - Each model has separate cache entries
Debug:
# Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)
onellm.init_cache()
# Check what's being cached
stats = onellm.cache_stats()
print(f"Entries: {stats['entries']}, Hits: {stats['hits']}, Misses: {stats['misses']}")
Out of Memory
Cache uses ~2.5KB per entry. For 1000 entries:
- Hash cache: ~1MB
- Semantic index: ~1.5MB per 1000 vectors
- Total: ~2.5MB per 1000 entries
Solutions:
- Reduce
max_entries:init_cache(max_entries=500) - Use hash-only mode:
init_cache(hash_only=True) - Clear cache periodically:
onellm.clear_cache()
API Reference
init_cache()
Initialize the global semantic cache.
onellm.init_cache(
max_entries: int = 1000,
similarity_threshold: float = 0.95,
p: float | None = None,
hash_only: bool = False
)
Parameters:
max_entries- Maximum cache entries before LRU eviction (default: 1000)similarity_threshold- Minimum similarity score for semantic hits (default: 0.95)p- Shorthand forsimilarity_threshold(e.g.,p=0.9)hash_only- Disable semantic matching, use only exact matches (default: False)
cache_stats()
Get cache statistics.
stats = onellm.cache_stats()
# Returns: {'hits': int, 'misses': int, 'entries': int}
clear_cache()
Clear all cached entries.
onellm.clear_cache()
disable_cache()
Disable caching.
onellm.disable_cache()
Further Reading
- Architecture - Overall OneLLM architecture
- Advanced Features - Fallbacks, retries, and more
- Configuration - API keys and provider setup
- Example: cache_example.py - Complete working example
Support
If you encounter issues with caching:
- Check this documentation
- Review examples/cache_example.py
- Open an issue on GitHub