llama.cpp
CPU-first inference library in C/C++ enabling quantised LLM inference on consumer hardware without a GPU.
Definition
llama.cpp is an MIT-licensed inference runtime written in pure C/C++ by Georgi Gerganov. It introduced the GGUF quantization format and supports 2-bit to 8-bit quantized models. While primarily CPU-targeted, llama.cpp also supports Apple Metal (M-series), CUDA, and Vulkan backends. It is the de facto standard for running large language models locally on consumer laptops and workstations and has a rich ecosystem of bindings (Python via llama-cpp-python, server mode with OpenAI-compatible API).
Related
More Software terms
PagedAttention
vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.
FlashAttention
IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.
Continuous Batching
Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.