llama.cpp Tutorial for OneLLM

What is llama.cpp?

llama.cpp is a lightweight, efficient way to run Large Language Models (LLMs) locally on your computer. It’s written in C++ for speed and has Python bindings that OneLLM can use. Think of it as a way to run AI models similar to ChatGPT, but completely offline on your own machine!

Key Concepts Explained

1. GGUF Model Files

What they are: GGUF (GPT-Generated Unified Format) files are compressed versions of AI models
Why they matter: They’re much smaller than original models (e.g., 4GB instead of 20GB)
Where to get them: Download from Hugging Face
Naming convention: modelname-size-quantization.gguf
- Example: llama-3-8b-instruct-q4_K_M.gguf
- 8b = 8 billion parameters (model size)
- q4_K_M = quantization level (compression type)

2. Quantization Levels

Think of quantization like image compression - you trade some quality for much smaller file size:

Level	Size	Quality	Use Case
Q8_0	Largest	Best	When quality matters most
Q5_K_M	Large	Very Good	Good balance
Q4_K_M	Medium	Good	Recommended for most users
Q3_K_M	Small	Okay	When space/RAM is limited
Q2_K	Tiny	Lower	Emergency only

Recommendation: Start with Q4_K_M - it’s the sweet spot!

3. Hardware Considerations

CPU vs GPU

CPU Only: Works on any computer, but slower
GPU Acceleration: Much faster, but requires compatible GPU
- NVIDIA GPUs: Use CUDA
- Mac M1/M2/M3: Use Metal
- AMD GPUs: Use ROCm (less common)

RAM Requirements

8B models: Need ~6-8GB RAM (Q4_K_M)
13B models: Need ~10-12GB RAM
70B models: Need ~40-50GB RAM

4. Context Window

What it is: How much text the model can “remember” in a conversation
Default: Usually 2048 tokens (~1500 words)
Larger context: Uses more RAM but remembers more
Recommendation: Start with 2048, increase if needed

Step-by-Step Setup Guide

Step 1: Install llama-cpp-python

For Mac Users (M1/M2/M3):

# Install with Metal support for GPU acceleration
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

For Windows/Linux with NVIDIA GPU:

# Install with CUDA support
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

For CPU only (any system):

# Basic installation
pip install llama-cpp-python

Step 2: Download a Model

Using OneLLM’s Download Utility (Recommended)

OneLLM includes a built-in command to download models:

# Download Llama 3 8B (recommended starter model)
onellm download -r shinkeonkim/Meta-Llama-3-8B-Instruct-Q4_K_M-GGUF \
                -f meta-llama-3-8b-instruct-q4_k_m.gguf

# Download Phi-3 Mini (smaller, faster)
onellm download -r microsoft/Phi-3-mini-4k-instruct-gguf \
                -f Phi-3-mini-4k-instruct-q4.gguf

The models will be saved to ~/llama_models by default.

Manual Download

Alternatively, you can manually download from Hugging Face:

Go to Hugging Face and search for GGUF models
Recommended starter models:
- Small & Fast: Phi-3 Mini
- Balanced: Llama 3 8B
- Powerful: Mixtral 8x7B
Download the Q4_K_M version (good balance of size/quality)

Step 3: Organize Your Models

Create a folder for your models:

# Create a models directory in your home folder
mkdir ~/llama_models

# Move your downloaded model there
mv ~/Downloads/llama-3-8b-instruct-q4_K_M.gguf ~/llama_models/

Step 4: Test Your Setup

from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="/Users/yourname/llama_models/llama-3-8b-instruct-q4_K_M.gguf",
    n_ctx=2048,  # Context window
    n_gpu_layers=1  # Use GPU if available (set to 0 for CPU only)
)

# Test it
response = llm("Hello! Can you explain what you are?", max_tokens=100)
print(response["choices"][0]["text"])

Configuration Options Explained

Basic Settings

llm = Llama(
    model_path="path/to/model.gguf",  # Path to your model file
    n_ctx=2048,        # Context window (how much text to remember)
    n_threads=8,       # CPU threads (set to your CPU core count)
    n_gpu_layers=32,   # GPU layers (0 = CPU only, higher = more GPU)
    temperature=0.7,   # Creativity (0 = focused, 1 = creative)
)

What Each Setting Does

n_ctx (Context Window)
- Default: 2048
- Higher = remembers more conversation
- But uses more RAM
- Try: 2048, 4096, or 8192
n_threads (CPU Threads)
- Default: 4
- Set to number of CPU cores
- Mac: sysctl -n hw.ncpu
- Windows/Linux: Check Task Manager/System Monitor
n_gpu_layers (GPU Acceleration)
- 0 = CPU only (slower but works everywhere)
- 1-100 = How many model layers on GPU
- Start with 32, adjust based on GPU memory
- If you get memory errors, reduce this number
temperature (Creativity)
- 0.1 = Very focused, factual
- 0.7 = Balanced (recommended)
- 1.0 = Very creative, more random

Recommended Setup for OneLLM

1. Directory Structure

~/llama_models/
├── general/
│   └── llama-3-8b-instruct-q4_K_M.gguf
├── code/
│   └── codellama-13b-instruct-q4_K_M.gguf
└── creative/
    └── mixtral-8x7b-instruct-q4_K_M.gguf

2. Environment Variables

Add to your .bashrc/.zshrc:

# Default model directory
export LLAMA_CPP_MODEL_DIR="$HOME/llama_models"

# Hardware settings (adjust based on your system)
export LLAMA_CPP_N_GPU_LAYERS=32  # 0 for CPU only
export LLAMA_CPP_N_CTX=2048        # Context window
export LLAMA_CPP_N_THREADS=8       # Your CPU core count

3. Model Naming for OneLLM

When llama.cpp is integrated with OneLLM, you’ll use models like:

# Full path approach
model="llama-cpp//Users/yourname/llama_models/llama-3-8b-instruct-q4_K_M.gguf"

# Or with configured model directory
model="llama-cpp/llama-3-8b-instruct-q4_K_M.gguf"

Performance Tips

1. Choosing the Right Model Size

4-8GB RAM: Use 3B-7B models
16GB RAM: Use 8B-13B models
32GB+ RAM: Can handle 30B+ models

2. Speed Optimization

Use GPU layers (n_gpu_layers) if you have a GPU
Use quantized models (Q4_K_M or Q5_K_M)
Reduce context window if you don’t need long conversations
Close other applications to free up RAM

3. Quality vs Speed Trade-offs

Need Speed? Use smaller models (3B-7B) with Q4_K_M
Need Quality? Use larger models (13B+) with Q5_K_M or higher
Best of Both? Use 8B models with Q4_K_M

Troubleshooting

“Out of Memory” Error

Solution:

Use a smaller model
Reduce n_gpu_layers
Use more aggressive quantization (Q3_K_M)

Slow Performance

Solution:

Enable GPU acceleration
Use a smaller model
Reduce context window
Check CPU usage

Installation Fails

Solution:

Update pip: pip install --upgrade pip
Install build tools:
- Mac: xcode-select --install
- Windows: Install Visual Studio Build Tools
- Linux: sudo apt-get install build-essential

Next Steps

Start Small: Begin with a 7B parameter model
Experiment: Try different models for different tasks
Join Community: llama.cpp Discord for help
Explore Models: Browse Hugging Face for more models

Quick Reference

Recommended Starter Setup

# For most users
model_path = "~/llama_models/llama-3-8b-instruct-q4_K_M.gguf"
n_gpu_layers = 32  # or 0 for CPU only
n_ctx = 2048
temperature = 0.7

Model Selection Guide

General Chat: Llama 3 8B
Coding: CodeLlama 13B
Creative Writing: Mixtral 8x7B
Fast Responses: Phi-3 Mini
Best Quality: Llama 3 70B (needs lots of RAM!)

That’s it! You’re ready to run LLMs locally with llama.cpp! 🚀