Skip to content
Software

llama.cpp

CPU-first inference library in C/C++ enabling quantised LLM inference on consumer hardware without a GPU.

Definition

llama.cpp is an MIT-licensed inference runtime written in pure C/C++ by Georgi Gerganov. It introduced the GGUF quantization format and supports 2-bit to 8-bit quantized models. While primarily CPU-targeted, llama.cpp also supports Apple Metal (M-series), CUDA, and Vulkan backends. It is the de facto standard for running large language models locally on consumer laptops and workstations and has a rich ecosystem of bindings (Python via llama-cpp-python, server mode with OpenAI-compatible API).

More Software terms