Software

llama.cpp

CPU-first inference library in C/C++ enabling quantised LLM inference on consumer hardware without a GPU.

Definition

llama.cpp is an MIT-licensed inference runtime written in pure C/C++ by Georgi Gerganov. It introduced the GGUF quantization format and supports 2-bit to 8-bit quantized models. While primarily CPU-targeted, llama.cpp also supports Apple Metal (M-series), CUDA, and Vulkan backends. It is the de facto standard for running large language models locally on consumer laptops and workstations and has a rich ecosystem of bindings (Python via llama-cpp-python, server mode with OpenAI-compatible API).

Quantization AWQ

More Software terms

PagedAttention

vLLM's technique for storing KV cache in non-contiguous memory pages, eliminating fragmentation and enabling larger effective batch sizes.

FlashAttention

IO-aware exact attention algorithm that tiles computation to stay in SRAM, cutting HBM reads/writes and speeding up attention by 2–4×.

Continuous Batching

Scheduling technique that adds new requests to a running batch as soon as any sequence finishes, maximising GPU utilisation compared to static batching.

Back to Glossary Start Reading — Chapter 0