AI & ML

Quantization Explained: How to Run 70B Models on Consumer GPUs

· 5 min read
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

A full-precision 70B parameter language model stored in FP32 demands roughly 280GB of memory. Even in FP16, that drops to around 140GB of VRAM. The most powerful consumer GPU on the market, the NVIDIA RTX 4090, ships with 24GB. The math does not work. This is where quantization comes in.

Table of Contents

The 70B Problem

A full-precision 70B parameter language model stored in FP32 demands roughly 280GB of memory. Even in FP16, that drops to around 140GB of VRAM. The most powerful consumer GPU on the market, the NVIDIA RTX 4090, ships with 24GB. The math does not work, and a 4× A100 80GB node lists at roughly $60,000. This is where quantization comes in: reducing the numerical precision of a model's weights to shrink its memory footprint so it can run on hardware that would otherwise be completely out of reach. A quantization guide can turn that impossible 140GB requirement into something that fits on a single consumer card.

By the end of this article, the concepts behind GGUF quantization, EXL2, and related formats will be clear. Readers will know how to calculate VRAM requirements for a 70B model GPU setup, understand exactly how much quality they sacrifice at each precision level, and have working code to run inference on consumer hardware.

What Is Quantization and Why Does It Matter?

From FP32 to 4-Bit: A Precision Primer

Every parameter in a neural network is stored as a number. The format of that number determines both its precision and how much memory it consumes. Here is the breakdown:

Precision Bits per Param Bytes per Param 70B Model Size
FP32 32 4.0 ~280 GB
FP16 16 2.0 ~140 GB
INT8 8 1.0 ~70 GB
INT4 4 0.5 ~35 GB

The arithmetic is straightforward: 70 billion parameters multiplied by 0.5 bytes per parameter (4-bit) yields 35GB. This assumes uniform 4 bits per weight. Real-world k-quant formats like Q4_K_M use mixed precision, averaging ~4.5 bits per weight and yielding ~38–40GB for weights alone before KV cache overhead. That still represents a dramatic reduction from 280GB at FP32. But naive rounding of weights to lower precision destroys model quality. If you simply truncate every weight, accumulated error across billions of parameters collapses coherence. This is why sophisticated quantization algorithms exist. They minimize information loss during the conversion by calibrating against representative data, grouping weights intelligently, and distributing rounding errors in ways that preserve the model's learned behavior.

If you simply truncate every weight, accumulated error across billions of parameters collapses coherence.

Post-Training Quantization vs. Quantization-Aware Training

Engineers quantize models in two ways. Quantization-Aware Training (QAT) incorporates reduced-precision simulation during the training process itself, allowing the model to adapt to lower precision. Post-Training Quantization (PTQ) reduces precision after training finishes and does not require retraining.

PTQ dominates the open-source ecosystem for a practical reason: retraining a 70B model is prohibitively expensive for most organizations, let alone individual developers. Foundational PTQ algorithms include GPTQ (which uses approximate second-order information to minimize layer-wise quantization error) and AWQ (Activation-Aware Weight Quantization, which preserves weights that correspond to large activation magnitudes). Both produce quantized models that can be distributed and used immediately without any training infrastructure.

Quantization Formats Compared: GGML, GGUF, EXL2, and AWQ

GGML (Legacy)

GGML originated as Georgi Gerganov's tensor library for efficient CPU-based inference. It served as the original format for llama.cpp quantized models. GGML is now deprecated (GGUF became the default format in llama.cpp during mid-2023), and there is no reason to use it unless working with legacy tooling that has not been updated.

GGUF: The CPU/GPU Hybrid Standard

GGUF is the single-file successor to GGML and the native format for llama.cpp. Its defining advantage is CPU offloading: llama.cpp splits layers between GPU VRAM and system RAM, allowing models that exceed available VRAM to still run at reduced speed. GGUF supports a range of quantization levels, each offering a different quality/size tradeoff: Q2_K, Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, and Q8_0. The "K" variants use k-quant methods that apply different bit-widths to different tensor types within each layer. Q4_K_M, for instance, averages about 4.5 bits per weight rather than a uniform 4 bits, because it allocates higher precision to more sensitive tensors.

If your VRAM cannot hold the full model and you are willing to trade inference speed for accessibility, GGUF is the right format.

EXL2: Maximum GPU Performance

EXL2 is ExLlamaV2's quantization format. Its standout feature is variable bits-per-weight across layers. Rather than applying uniform 4-bit quantization everywhere, EXL2 allocates more bits to layers that are sensitive to precision loss and fewer bits to layers that tolerate it. This per-layer optimization produces better quality at equivalent average bit-widths.

When the entire model fits in GPU VRAM and speed is the priority, EXL2 is the strongest option.

GPTQ and AWQ (Brief Mentions)

GPTQ is a mature GPU-only format with wide support through AutoGPTQ and integration with Hugging Face Transformers. AWQ offers slightly better quality than GPTQ at the same bit-width by prioritizing weights tied to large activations. Both remain viable, but advanced users increasingly prefer EXL2 for its per-layer flexibility and speed.

Feature GGUF EXL2 GPTQ AWQ
Bits-per-weight options Q2_K through Q8_0 Variable (e.g., 3.0–6.0 bpw) 2, 3, 4, 8-bit Primarily 4-bit
CPU offload support Yes No No No
Relative speed (GPU-only) Moderate Fastest Fast Fast
Primary tooling llama.cpp ExLlamaV2 AutoGPTQ AutoAWQ
Best use case Limited VRAM, hybrid CPU/GPU Full VRAM fit, max speed Broad compatibility Quality at 4-bit

Calculating VRAM Requirements

The VRAM Formula

The base formula for estimating VRAM consumption is:

VRAM (GB) ≈ (Parameters × Bits per Weight) / 8 / 1,073,741,824 + Context Overhead

The context overhead comes primarily from the KV cache, which scales with context length, number of layers, head dimensions, and the number of key-value heads. For a 70B model like Llama 3 70B (80 layers, GQA with 8 KV heads, head dimension 128), the KV cache at a 4096-token context consumes roughly 1.25 to 2GB (assuming FP16 KV cache; the lower bound reflects the raw KV tensor size, while overhead from memory allocation and fragmentation pushes the practical figure higher). At 8K context, double that estimate. At 32K, expect 8 to 10GB of overhead, which can dominate the VRAM budget at aggressive quantization levels.

Practical VRAM Table for 70B Models

All estimates below assume rounding to the nearest GB. The raw arithmetic sometimes produces fractional values; these are rounded up to reflect practical allocation behavior.

Quantization Level Est. VRAM (4K context) Fits 24GB GPU? Fits with CPU Offload?
Q4_K_M (GGUF, ~4.5 bpw) ~38 GB No Yes
Q5_K_M (GGUF) ~46 GB No Yes
Q6_K (GGUF) ~54 GB No Yes
EXL2 4.0 bpw ~37 GB No N/A (no offload)
EXL2 3.0 bpw ~28 GB¹ Tight N/A

¹ The 3.0 bpw weight calculation yields ~24.5GB for weights plus ~1.3GB KV cache ≈ ~25.8GB. The ~28GB figure includes about 2GB of CUDA runtime and memory allocator overhead typical during inference.

For a 70B model, even 4-bit quantization does not fit entirely in 24GB of VRAM. This is precisely why GGUF's CPU offloading capability matters: the layers that spill beyond VRAM are served from system RAM at the cost of slower inference.

For a 70B model, even 4-bit quantization does not fit entirely in 24GB of VRAM. This is precisely why GGUF's CPU offloading capability matters.

Interactive VRAM Calculator

The following Python function implements the core VRAM estimation logic:

def estimate_vram_gb(num_params_b, bits_per_weight, context_length,
                     num_layers, head_dim, num_kv_heads, bytes_per_element=2):
    """
    Estimate total VRAM usage for a quantized model with KV cache.

    Returns GiB (1024-based). Reported as 'GB' per common convention in this domain.

    bytes_per_element: 2 for FP16 KV cache (default), 4 for FP32 KV cache.
    """
    if num_params_b <= 0 or bits_per_weight <= 0 or context_length <= 0:
        raise ValueError("num_params_b, bits_per_weight, and context_length must be positive.")
    if num_layers <= 0 or head_dim <= 0 or num_kv_heads <= 0:
        raise ValueError("num_layers, head_dim, and num_kv_heads must be positive.")

    # Model weights: convert B params to bytes, then to GiB
    weight_gb = (num_params_b * 1e9 * bits_per_weight) / 8 / (1024**3)

    # KV cache: 2 tensors (key + value) × layers × kv_heads × head_dim × context × bytes
    kv_cache_gb = (2 * num_layers * num_kv_heads * head_dim *
                   context_length * bytes_per_element) / (1024**3)

    return weight_gb + kv_cache_gb  # Caller rounds for display


# Llama 3 70B at Q4_K_M (~4.5 bpw mixed precision) with 4096 context
# Q4_K_M uses mixed bit-widths averaging ~4.5 bpw, not uniform 4.0
vram = estimate_vram_gb(70, 4.5, 4096, 80, 128, 8)
print(f"Estimated VRAM: {vram:.2f} GB")

For an embeddable web tool, the same logic translates directly to JavaScript:

function estimateVRAM(paramsB, bpw, ctxLen, layers, headDim, kvHeads, bytesPerElement = 2) {
  // bytesPerElement: 2 = FP16 KV cache (default), 4 = FP32 KV cache
  if (paramsB <= 0 || bpw <= 0 || ctxLen <= 0) {
    throw new RangeError("paramsB, bpw, and ctxLen must be positive.");
  }

  const weightGB = (paramsB * 1e9 * bpw) / 8 / Math.pow(1024, 3);

  const kvCacheGB = (2 * layers * kvHeads * headDim * ctxLen * bytesPerElement)
                    / Math.pow(1024, 3);

  return (weightGB + kvCacheGB).toFixed(2);
}

// FP16 KV cache (default): estimateVRAM(70, 4.5, 4096, 80, 128, 8)
// FP32 KV cache:           estimateVRAM(70, 4.5, 4096, 80, 128, 8, 4)

A production widget would pair this with a dropdown for common GPU VRAM capacities, displaying a fit/no-fit indicator and a recommendation for the highest quality quantization level that fits the selected hardware.

Running a 70B GGUF Model with llama.cpp

Prerequisites

  • OS: The commands below are written for Linux. macOS and Windows users will need to adjust build flags (see platform notes inline).
  • CUDA toolkit: Version 11.8 or 12.x, with a compatible NVIDIA GPU driver.
  • CMake: Version 3.14 or higher.
  • Disk space: ~38–40GB free for the Q4_K_M GGUF model file.
  • System RAM: If offloading layers to CPU, ensure sufficient system RAM for the offloaded portion. For example, offloading 40 of 80 layers of a ~38GB model requires roughly 19GB of system RAM. A minimum of 32GB system RAM is recommended for comfortable operation.
  • Python 3.8+ (for huggingface-cli).

Setup and Launching with Layer Splitting

Build llama.cpp with CUDA support, download a GGUF model, and run inference with partial GPU offloading:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON

# Detect available CPU threads; fall back to 4 only if both tools are absent
if JOBS=$(nproc 2>/dev/null); then
  : # Linux: nproc succeeded
elif JOBS=$(sysctl -n hw.logicalcpu 2>/dev/null); then
  : # macOS: sysctl succeeded
else
  JOBS=4
  echo "[WARN] Could not detect CPU count; defaulting to -j${JOBS}"
fi

# On Windows, omit the -j flag or specify a thread count manually, e.g., -j 8.
cmake --build build --config Release -j"${JOBS}"

pip install huggingface-hub

# Verify the repository and filename exist at huggingface.co before running.
# Model hosting repos change over time; confirm availability before downloading.
huggingface-cli download bartowski/Meta-Llama-3-70B-Instruct-GGUF \
  Meta-Llama-3-70B-Instruct-Q4_K_M.gguf --local-dir ./models

./build/bin/llama-cli \
  -m ./models/Meta-Llama-3-70B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 40 \
  --ctx-size 4096 \
  -p "Explain the tradeoffs of model quantization in three paragraphs."

The --n-gpu-layers 40 flag offloads 40 of the model's 80 layers to the GPU, with the remaining layers processed in system RAM. Determining the right number is empirical: start with a conservative value, monitor VRAM usage with nvidia-smi, and increase until VRAM is nearly full. Leave about 1 to 2GB headroom for the KV cache and CUDA overhead.

Measuring Inference Speed

llama.cpp prints timing statistics after generation completes:

llama_print_timings: eval time = 12543.21 ms / 150 tokens
llama_print_timings: speed: 8.34 tokens/s

On an RTX 4090 with 40 of 80 layers offloaded (Ryzen 7 CPU, DDR5-5600, PCIe 4.0), a Q4_K_M 70B model achieves 8 to 12 tokens per second. Actual speed depends on CPU model, system RAM bandwidth, and PCIe generation. Offloading fewer layers reduces speed substantially because CPU-bound layers become the bottleneck.

Running a 70B EXL2 Model with ExLlamaV2

Prerequisites

  • CUDA toolkit 11.8 or higher.
  • Python 3.8+.
  • GPU VRAM must exceed the model's estimated footprint. Use nvidia-smi to check free VRAM before loading. If VRAM is insufficient, load_autosplit will crash with an out-of-memory error.

Setup and Inference

Install ExLlamaV2 and its dependencies first:

pip install exllamav2

Refer to the ExLlamaV2 releases page if you need a wheel built for a specific CUDA version.

Download a pre-quantized EXL2 model. The --revision flag specifies a Git branch or tag within the Hugging Face repository (not a filename filter). Verify the branch exists by checking the repo page or running huggingface-cli repo info before downloading:

# --revision is a Git ref (branch/tag). Confirm the branch "4.0bpw" exists in the repo
# before running. If the branch does not exist, this will fail with RevisionNotFoundError.
huggingface-cli download turboderp/Llama-3-70B-Instruct-exl2 \
  --revision 4.0bpw --local-dir ./models/llama-3-70b-exl2-4.0bpw

Alternatively, quantize from a base model using ExLlamaV2's convert.py script.

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
# If ImportError on the next line, try: from exllamav2.tokenizer import ExLlamaV2Tokenizer
from exllamav2 import ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler
import time
import sys

config = ExLlamaV2Config("./models/llama-3-70b-exl2-4.0bpw")
config.max_seq_len = 4096

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)

try:
    model.load_autosplit(cache)
except Exception as e:
    print(f"[ERROR] Model load failed (OOM or path error): {e}")
    sys.exit(1)

tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7

prompt = "Explain the tradeoffs of model quantization."

# Note: generate_simple includes internal tokenization of the prompt,
# so the reported speed is end-to-end (prefill + generation), not generation-only.
start = time.time()

# generate_simple returns (output_string, num_generated_tokens)
output, gen_tokens = generator.generate_simple(prompt, settings, num_tokens=150)

elapsed = time.time() - start

# Use actual generated token count, not the requested maximum,
# since the model may stop early at EOS.
print(f"Speed: {gen_tokens / elapsed:.1f} tokens/s  ({gen_tokens} tokens in {elapsed:.2f}s)")
print(output)

model.unload()

Performance Comparison vs. GGUF

When a model fits entirely in VRAM, EXL2 runs 2 to 3 times faster than GGUF at equivalent quality (based on GPU-only inference benchmarks on an RTX 4090, 4K context, batch-1, comparing 4.0 bpw EXL2 to Q4_K_M GGUF; exact ratios vary by GPU, context length, and batch size). EXL2 loses this speed advantage entirely when the model exceeds VRAM, because EXL2 has no CPU offloading capability (as of the current ExLlamaV2 release). For a 70B model on a single 24GB GPU, GGUF with partial offloading is the only viable path. EXL2 becomes practical for 70B models on dual-GPU setups or for smaller models (up to about 30B parameters at 4-bit) on a single 24GB card.

Measuring Quality Impact: How Much Do You Lose?

Perplexity as a Quality Metric

Perplexity measures how well a model predicts a held-out test set; lower values indicate better prediction. The following values are drawn from llama.cpp perplexity evaluations of Llama 3 70B on wikitext-2-raw-v1 (see the llama.cpp perplexity discussion on GitHub for methodology and additional runs). Results vary by evaluation harness, tokenizer, and setup, so treat these as illustrative rather than exact:

Quantization Perplexity (wiki2) Relative Increase vs FP16
FP16 ~3.32 Baseline
Q8_0 ~3.33 ~0.3%
Q6_K ~3.34 ~0.6%
Q5_K_M ~3.35 ~0.9%
Q4_K_M ~3.37 ~1.5%
Q3_K_M ~3.46 ~4.2%
Q2_K ~4.20 ~26.5%

For most use cases, Q4_K_M and Q5_K_M offer a practical balance, with perplexity increases below 2% relative to FP16. Below Q3_K_M, degradation becomes noticeable in complex reasoning and multi-step logic tasks, even when perplexity numbers still look superficially reasonable. At Q2_K, models exhibit clear loss of coherence in longer outputs. The general guidance: 4-bit is the floor for production-quality output.

The general guidance: 4-bit is the floor for production-quality output.

Decision Framework: Choosing the Right Setup

If the quantized model fits entirely in GPU VRAM, use EXL2 for maximum speed. If the model exceeds VRAM but fits in VRAM plus system RAM, use GGUF with layer offloading. When VRAM is not the bottleneck, bit-width matters more than format: use the highest bit-width that fits (Q6_K or 6.0 bpw). When speed matters most and you can tolerate some quality loss, drop to the lowest acceptable bit-width in EXL2.

GPU VRAM Largest 70B Config Recommended Format Expected Speed Notes
RTX 3060 12 GB 70B Q4_K_M (heavy offload) GGUF ~2-4 tok/s Requires ≥32GB system RAM
RTX 3090 24 GB 70B Q4_K_M (partial offload) GGUF ~6-10 tok/s
RTX 4090 24 GB 70B Q4_K_M (partial offload) GGUF ~8-12 tok/s
2× RTX 3090 48 GB 70B EXL2 4.0bpw (full VRAM) EXL2 ~20-30 tok/s (estimate; varies by ExLlamaV2 version and interconnect)

Key Takeaways

Quantization is not a hack. It is a well-understood engineering tradeoff with measurable costs. Q4_K_M in GGUF format is the pragmatic default for most consumer GPU setups running 70B models, and EXL2 takes the lead when VRAM permits full model loading. Start with the VRAM calculator above, then validate quality on the specific tasks that matter for your use case.