llama.cpp Provider Implementation Plan

Based on the tutorial, here’s a simple approach for the llama.cpp provider:

Model Naming Convention

Support two formats:

Full path: llama-cpp//Users/ran/models/llama-3-8b-q4_K_M.gguf
Model name: llama-cpp/llama-3-8b-q4_K_M.gguf (searches in configured directory)

Default Configuration

# In config.py
"llama_cpp": {
    "model_dir": None,  # Defaults to ~/llama_models or LLAMA_CPP_MODEL_DIR
    "n_ctx": 2048,      # Context window
    "n_gpu_layers": 0,  # GPU layers (0 = CPU only)
    "n_threads": None,  # Auto-detect CPU cores
    "temperature": 0.7, # Default temperature
    "timeout": 300,     # 5 minutes for model loading
}

Environment Variables

LLAMA_CPP_MODEL_DIR=/path/to/models  # Default model directory
LLAMA_CPP_N_GPU_LAYERS=32            # GPU acceleration
LLAMA_CPP_N_CTX=2048                 # Context window
LLAMA_CPP_N_THREADS=8                # CPU threads

Simple Usage

from onellm import Client

client = Client()

# Use model from default directory
response = await client.chat.completions.create(
    model="llama-cpp/llama-3-8b-instruct-q4_K_M.gguf",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Use full path
response = await client.chat.completions.create(
    model="llama-cpp//home/user/models/mixtral-8x7b-q4_K_M.gguf",
    messages=[{"role": "user", "content": "Hello!"}]
)

# With custom settings
response = await client.chat.completions.create(
    model="llama-cpp/llama-3-8b-instruct-q4_K_M.gguf",
    messages=[{"role": "user", "content": "Hello!"}],
    n_gpu_layers=32,  # Use GPU
    temperature=0.3,  # More focused
    max_tokens=500
)

Implementation Strategy

Model Loading: Cache loaded models to avoid reloading
Path Resolution: Check both full paths and model directory
Auto-detection: Detect CPU cores if n_threads not set
Error Messages: Clear instructions when llama-cpp-python not installed
Memory Management: Unload models after inactivity timeout

Installation Message

When llama-cpp-python is not installed:

llama.cpp provider requires llama-cpp-python. Install it with:

# For CPU only:
pip install llama-cpp-python

# For GPU acceleration (Mac M1/M2/M3):
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

# For NVIDIA GPUs:
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

See docs/llama_cpp_tutorial.md for detailed setup instructions.

Would you like me to implement the provider with this simple approach?